vLLM Local Inference Server Setup Guide

vLLM Introduction

vLLM is a high-performance large language model inference engine that uses PagedAttention for efficient memory management and request batching. It provides an OpenAI-compatible API server that integrates directly with OpenClaw for a fully self-hosted AI assistant.

Prerequisites

NVIDIA GPU (RTX 3090/4090 or A100 recommended)
CUDA 12.1+
Python 3.9+
Sufficient disk space for model files

Install vLLM

pip install vllm

Docker (recommended):

docker pull vllm/vllm-openai:latest

Start the vLLM Service

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "vllm-local-key" \
  --max-model-len 8192

Configure in OpenClaw

{
  "providers": {
    "vllm": {
      "type": "openai",
      "baseUrl": "http://localhost:8000/v1",
      "apiKey": "vllm-local-key",
      "models": ["meta-llama/Llama-3.1-8B-Instruct"]
    }
  },
  "models": {
    "local-llama": {
      "provider": "vllm",
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "temperature": 0.7,
      "maxTokens": 4096
    }
  }
}

Performance Tuning

Quantization

vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.9

Tensor Parallelism (Multi-GPU)

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

Batch Optimization

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 64

Common Model Configurations

Qwen 2.5 (Chinese-optimized): Qwen/Qwen2.5-14B-Instruct
Mistral (general): mistralai/Mistral-7B-Instruct-v0.3
Custom fine-tuned models: Load from a local path

Production Deployment Tips

Use systemd to manage the service for auto-start on boot
Use Nginx as a reverse proxy for SSL and access control
Monitor GPU usage with nvidia-smi
Set up log rotation

Common Questions

Q: Running out of GPU memory? Use quantized models (AWQ/GPTQ) or reduce max-model-len and gpu-memory-utilization.

Q: First startup is slow? The first run downloads model files. Use huggingface-cli download to pre-download.

Q: How to serve multiple models? Run multiple vLLM instances on different ports and configure multiple providers in OpenClaw.

Summary

vLLM is the premier engine for self-hosted AI inference, with excellent performance and full OpenAI API compatibility. Combined with OpenClaw, it enables fully private deployment where data never leaves your server -- ideal for scenarios with strict privacy requirements.