Introduction
Local large language models are an important option within the OpenClaw ecosystem. They are completely free to use, keep all data on your machine, and are not subject to network latency. With the rapid advancement of open-source models, running a capable LLM on consumer-grade GPUs is entirely feasible today. This article provides a detailed overview of local model deployment options and hardware-specific selection recommendations.
Comparing Local Deployment Options
There are currently three mainstream approaches for running local models:
| Solution | Highlights | Ideal For | Learning Curve |
|---|---|---|---|
| Ollama | CLI tool, one-click installation | Developers, Linux users | Low |
| LM Studio | Graphical interface, model store | Beginners, Windows/Mac users | Very low |
| llama.cpp | Low-level runtime, maximum flexibility | Advanced users, custom needs | High |
Ollama (Recommended)
Ollama is the most popular local model runtime and offers the most seamless integration with OpenClaw.
Installation:
# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download the installer from ollama.com
Basic usage:
# Download and run a model
ollama pull llama3.3:70b
# Start the Ollama service (usually starts automatically after installation)
ollama serve
# List downloaded models
ollama list
LM Studio
LM Studio provides a user-friendly graphical interface with one-click model downloading and running.
- Download and install from lmstudio.ai
- Search for and download models in the model store
- Start the local server (default port 1234)
llama.cpp
llama.cpp is the underlying inference engine that Ollama is built on. It is best suited for advanced users who need fine-grained control.
# Build (requires cmake)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # GPU acceleration
cmake --build build --config Release
# Run a model
./build/bin/llama-server -m model.gguf --port 8080
Hardware Requirements and Model Selection
Choosing Models by VRAM
Model size and VRAM requirements are directly correlated. Here are recommendations for different VRAM levels:
8GB VRAM (RTX 4060 / RTX 3070, etc.)
| Model | Parameters | Quantization | VRAM Usage | Performance |
|---|---|---|---|---|
| Llama 3.2 3B | 3B | Q8_0 | ~4 GB | Adequate for simple conversations |
| Qwen 2.5 7B | 7B | Q4_K_M | ~5 GB | Excellent Chinese performance |
| Mistral 7B | 7B | Q4_K_M | ~5 GB | Strong English capabilities |
| DeepSeek V2 Lite 16B | 16B | Q3_K_M | ~7 GB | MoE architecture, fast inference |
# Recommended download for 8GB VRAM
ollama pull qwen2.5:7b-instruct-q4_K_M
16GB VRAM (RTX 4080 / RTX 4070 Ti, etc.)
| Model | Parameters | Quantization | VRAM Usage | Performance |
|---|---|---|---|---|
| Llama 3.3 8B | 8B | Q8_0 | ~9 GB | High quality, recommended |
| Qwen 2.5 14B | 14B | Q4_K_M | ~10 GB | Among the best for Chinese |
| Mistral Small 22B | 22B | Q4_K_M | ~14 GB | Strong multilingual support |
| DeepSeek V3 Lite | 24B | Q4_K_M | ~15 GB | Outstanding reasoning ability |
# Recommended downloads for 16GB VRAM
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull llama3.3:8b-instruct-q8_0
24GB VRAM (RTX 4090 / RTX 3090, etc.)
| Model | Parameters | Quantization | VRAM Usage | Performance |
|---|---|---|---|---|
| Llama 3.3 70B | 70B | Q4_K_M | ~42 GB* | Requires CPU offload |
| Qwen 2.5 32B | 32B | Q4_K_M | ~20 GB | Exceptional Chinese ability |
| DeepSeek R1 32B | 32B | Q4_K_M | ~20 GB | Reasoning-enhanced model |
| Mistral Large 123B | 123B | Q2_K | ~48 GB* | Requires CPU offload |
*Models exceeding VRAM can be partially offloaded to system memory, but speed will decrease.
# Recommended downloads for 24GB VRAM
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull deepseek-r1:32b
No Dedicated GPU / CPU Inference
If you do not have a dedicated GPU, you can still run models using CPU inference, though it will be slower:
# Smaller models recommended for CPU inference
ollama pull llama3.2:3b-instruct-q4_K_M
ollama pull qwen2.5:3b-instruct-q4_K_M
We recommend at least 16GB of RAM for 3B models and 32GB for 7B models.
Quantization Levels Explained
Quantization compresses model weights from high precision (FP16) to lower precision to reduce VRAM usage:
| Quantization Level | Precision Loss | Size (vs FP16) | Recommendation |
|---|---|---|---|
| Q8_0 | Minimal | ~50% | Best choice when VRAM allows |
| Q6_K | Very small | ~43% | Good balance of quality and size |
| Q5_K_M | Small | ~37% | Recommended |
| Q4_K_M | Moderate | ~30% | Most commonly used, recommended |
| Q3_K_M | Noticeable | ~23% | Use when VRAM is tight |
| Q2_K | Significant | ~18% | Use only as a last resort |
Rule of thumb: Q4_K_M offers the best value, and when VRAM permits, go with Q6_K or Q8_0.
OpenClaw Configuration
Connecting to Ollama
{
models: {
ollama: {
provider: "ollama",
baseUrl: "http://localhost:11434", // Ollama default port
defaultModel: "qwen2.5:14b-instruct-q4_K_M",
parameters: {
temperature: 0.7,
maxTokens: 4096,
numCtx: 8192, // Context window size
}
}
}
}
Connecting to LM Studio
{
models: {
lmstudio: {
provider: "openai", // LM Studio is compatible with the OpenAI API
baseUrl: "http://localhost:1234/v1", // LM Studio default address
apiKey: "lm-studio", // LM Studio does not validate the key
defaultModel: "loaded-model", // Uses the currently loaded model
}
}
}
Connecting to llama.cpp Server
{
models: {
llamacpp: {
provider: "openai", // Compatible with the OpenAI API
baseUrl: "http://localhost:8080/v1",
apiKey: "none",
defaultModel: "local-model",
}
}
}
Speed vs. Quality Trade-offs
Factors Affecting Inference Speed
- Memory bandwidth: RTX 4090 (1 TB/s) is much faster than RTX 4060 (272 GB/s)
- Model size: More parameters means slower inference
- Quantization level: Lower quantization is faster but reduces quality
- Context length: Longer conversations slow down inference
- Concurrency: Multiple simultaneous users reduce speed
Speed Benchmarks (RTX 4090, Q4_K_M Quantization)
| Model | Generation Speed (tokens/s) | Experience |
|---|---|---|
| 3B | 120+ | Extremely fast |
| 7B | 80-100 | Very fast |
| 14B | 45-60 | Smooth |
| 32B | 20-30 | Acceptable |
| 70B (partial offload) | 5-10 | Slow |
Recommended Models Summary
| Use Case | Recommended Model | Notes |
|---|---|---|
| Chinese conversation | Qwen 2.5 (7B/14B/32B) | Best Chinese capabilities |
| English conversation | Llama 3.3 (8B/70B) | Excellent all-around performance |
| Code generation | DeepSeek Coder V2 | Specialized for coding |
| Reasoning and analysis | DeepSeek R1 (32B) | Chain-of-thought reasoning |
| Multilingual | Mistral (7B/22B) | Well-balanced across languages |
| Minimal resources | Llama 3.2 3B | Smallest viable model |
Troubleshooting
Cannot Connect to Ollama
# Check if Ollama is running
curl http://localhost:11434/api/version
# If not running, start it manually
ollama serve
Model Fails to Load
Error: model requires more memory than available
Solutions:
- Use a lower quantization level (e.g., Q3_K_M)
- Switch to a smaller model
- Close other programs that are consuming VRAM
Garbled Chinese Text
Some models have poor Chinese support. We recommend using the Qwen or DeepSeek series, which have been specifically optimized for Chinese.
Summary
Local LLMs are the best choice for users who prioritize data privacy and zero-cost operation. Ollama is the most convenient solution for integration with OpenClaw. When VRAM allows, choose the largest model and highest quantization level possible. For Chinese use cases, prioritize the Qwen series; for English, go with Llama. If your budget allows a GPU upgrade, a single RTX 4090 (24GB) is sufficient to run high-quality 32B-class models.