vLLM Introduction
vLLM is a high-performance large language model inference engine that uses PagedAttention for efficient memory management and request batching. It provides an OpenAI-compatible API server that integrates directly with OpenClaw for a fully self-hosted AI assistant.
Prerequisites
- NVIDIA GPU (RTX 3090/4090 or A100 recommended)
- CUDA 12.1+
- Python 3.9+
- Sufficient disk space for model files
Install vLLM
pip install vllm
Docker (recommended):
docker pull vllm/vllm-openai:latest
Start the vLLM Service
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key "vllm-local-key" \
--max-model-len 8192
Configure in OpenClaw
{
"providers": {
"vllm": {
"type": "openai",
"baseUrl": "http://localhost:8000/v1",
"apiKey": "vllm-local-key",
"models": ["meta-llama/Llama-3.1-8B-Instruct"]
}
},
"models": {
"local-llama": {
"provider": "vllm",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"temperature": 0.7,
"maxTokens": 4096
}
}
}
Performance Tuning
Quantization
vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.9
Tensor Parallelism (Multi-GPU)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
Batch Optimization
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--max-num-batched-tokens 8192 \
--max-num-seqs 64
Common Model Configurations
- Qwen 2.5 (Chinese-optimized):
Qwen/Qwen2.5-14B-Instruct - Mistral (general):
mistralai/Mistral-7B-Instruct-v0.3 - Custom fine-tuned models: Load from a local path
Production Deployment Tips
- Use systemd to manage the service for auto-start on boot
- Use Nginx as a reverse proxy for SSL and access control
- Monitor GPU usage with
nvidia-smi - Set up log rotation
Common Questions
Q: Running out of GPU memory?
Use quantized models (AWQ/GPTQ) or reduce max-model-len and gpu-memory-utilization.
Q: First startup is slow?
The first run downloads model files. Use huggingface-cli download to pre-download.
Q: How to serve multiple models? Run multiple vLLM instances on different ports and configure multiple providers in OpenClaw.
Summary
vLLM is the premier engine for self-hosted AI inference, with excellent performance and full OpenAI API compatibility. Combined with OpenClaw, it enables fully private deployment where data never leaves your server -- ideal for scenarios with strict privacy requirements.