Home Tutorials Categories Skills About
ZH EN JA KO
Model Integration

vLLM Local Inference Server Setup Guide

· 6 min read

vLLM Introduction

vLLM is a high-performance large language model inference engine that uses PagedAttention for efficient memory management and request batching. It provides an OpenAI-compatible API server that integrates directly with OpenClaw for a fully self-hosted AI assistant.

Prerequisites

  • NVIDIA GPU (RTX 3090/4090 or A100 recommended)
  • CUDA 12.1+
  • Python 3.9+
  • Sufficient disk space for model files

Install vLLM

pip install vllm

Docker (recommended):

docker pull vllm/vllm-openai:latest

Start the vLLM Service

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "vllm-local-key" \
  --max-model-len 8192

Configure in OpenClaw

{
  "providers": {
    "vllm": {
      "type": "openai",
      "baseUrl": "http://localhost:8000/v1",
      "apiKey": "vllm-local-key",
      "models": ["meta-llama/Llama-3.1-8B-Instruct"]
    }
  },
  "models": {
    "local-llama": {
      "provider": "vllm",
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "temperature": 0.7,
      "maxTokens": 4096
    }
  }
}

Performance Tuning

Quantization

vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.9

Tensor Parallelism (Multi-GPU)

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

Batch Optimization

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 64

Common Model Configurations

  • Qwen 2.5 (Chinese-optimized): Qwen/Qwen2.5-14B-Instruct
  • Mistral (general): mistralai/Mistral-7B-Instruct-v0.3
  • Custom fine-tuned models: Load from a local path

Production Deployment Tips

  1. Use systemd to manage the service for auto-start on boot
  2. Use Nginx as a reverse proxy for SSL and access control
  3. Monitor GPU usage with nvidia-smi
  4. Set up log rotation

Common Questions

Q: Running out of GPU memory? Use quantized models (AWQ/GPTQ) or reduce max-model-len and gpu-memory-utilization.

Q: First startup is slow? The first run downloads model files. Use huggingface-cli download to pre-download.

Q: How to serve multiple models? Run multiple vLLM instances on different ports and configure multiple providers in OpenClaw.

Summary

vLLM is the premier engine for self-hosted AI inference, with excellent performance and full OpenAI API compatibility. Combined with OpenClaw, it enables fully private deployment where data never leaves your server -- ideal for scenarios with strict privacy requirements.

OpenClaw is a free, open-source personal AI assistant that supports WhatsApp, Telegram, Discord, and many more platforms