Local LLM Selection and Deployment Guide for OpenClaw

Introduction

Local large language models are an important option within the OpenClaw ecosystem. They are completely free to use, keep all data on your machine, and are not subject to network latency. With the rapid advancement of open-source models, running a capable LLM on consumer-grade GPUs is entirely feasible today. This article provides a detailed overview of local model deployment options and hardware-specific selection recommendations.

Comparing Local Deployment Options

There are currently three mainstream approaches for running local models:

Solution	Highlights	Ideal For	Learning Curve
Ollama	CLI tool, one-click installation	Developers, Linux users	Low
LM Studio	Graphical interface, model store	Beginners, Windows/Mac users	Very low
llama.cpp	Low-level runtime, maximum flexibility	Advanced users, custom needs	High

Ollama (Recommended)

Ollama is the most popular local model runtime and offers the most seamless integration with OpenClaw.

Installation:

# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download the installer from ollama.com

Basic usage:

# Download and run a model
ollama pull llama3.3:70b

# Start the Ollama service (usually starts automatically after installation)
ollama serve

# List downloaded models
ollama list

LM Studio

LM Studio provides a user-friendly graphical interface with one-click model downloading and running.

Download and install from lmstudio.ai
Search for and download models in the model store
Start the local server (default port 1234)

llama.cpp

llama.cpp is the underlying inference engine that Ollama is built on. It is best suited for advanced users who need fine-grained control.

# Build (requires cmake)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # GPU acceleration
cmake --build build --config Release

# Run a model
./build/bin/llama-server -m model.gguf --port 8080

Hardware Requirements and Model Selection

Choosing Models by VRAM

Model size and VRAM requirements are directly correlated. Here are recommendations for different VRAM levels:

8GB VRAM (RTX 4060 / RTX 3070, etc.)

Model	Parameters	Quantization	VRAM Usage	Performance
Llama 3.2 3B	3B	Q8_0	~4 GB	Adequate for simple conversations
Qwen 2.5 7B	7B	Q4_K_M	~5 GB	Excellent Chinese performance
Mistral 7B	7B	Q4_K_M	~5 GB	Strong English capabilities
DeepSeek V2 Lite 16B	16B	Q3_K_M	~7 GB	MoE architecture, fast inference

# Recommended download for 8GB VRAM
ollama pull qwen2.5:7b-instruct-q4_K_M

16GB VRAM (RTX 4080 / RTX 4070 Ti, etc.)

Model	Parameters	Quantization	VRAM Usage	Performance
Llama 3.3 8B	8B	Q8_0	~9 GB	High quality, recommended
Qwen 2.5 14B	14B	Q4_K_M	~10 GB	Among the best for Chinese
Mistral Small 22B	22B	Q4_K_M	~14 GB	Strong multilingual support
DeepSeek V3 Lite	24B	Q4_K_M	~15 GB	Outstanding reasoning ability

# Recommended downloads for 16GB VRAM
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull llama3.3:8b-instruct-q8_0

24GB VRAM (RTX 4090 / RTX 3090, etc.)

Model	Parameters	Quantization	VRAM Usage	Performance
Llama 3.3 70B	70B	Q4_K_M	~42 GB*	Requires CPU offload
Qwen 2.5 32B	32B	Q4_K_M	~20 GB	Exceptional Chinese ability
DeepSeek R1 32B	32B	Q4_K_M	~20 GB	Reasoning-enhanced model
Mistral Large 123B	123B	Q2_K	~48 GB*	Requires CPU offload

*Models exceeding VRAM can be partially offloaded to system memory, but speed will decrease.

# Recommended downloads for 24GB VRAM
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull deepseek-r1:32b

No Dedicated GPU / CPU Inference

If you do not have a dedicated GPU, you can still run models using CPU inference, though it will be slower:

# Smaller models recommended for CPU inference
ollama pull llama3.2:3b-instruct-q4_K_M
ollama pull qwen2.5:3b-instruct-q4_K_M

We recommend at least 16GB of RAM for 3B models and 32GB for 7B models.

Quantization Levels Explained

Quantization compresses model weights from high precision (FP16) to lower precision to reduce VRAM usage:

Quantization Level	Precision Loss	Size (vs FP16)	Recommendation
Q8_0	Minimal	~50%	Best choice when VRAM allows
Q6_K	Very small	~43%	Good balance of quality and size
Q5_K_M	Small	~37%	Recommended
Q4_K_M	Moderate	~30%	Most commonly used, recommended
Q3_K_M	Noticeable	~23%	Use when VRAM is tight
Q2_K	Significant	~18%	Use only as a last resort

Rule of thumb: Q4_K_M offers the best value, and when VRAM permits, go with Q6_K or Q8_0.

OpenClaw Configuration

Connecting to Ollama

{
  models: {
    ollama: {
      provider: "ollama",
      baseUrl: "http://localhost:11434",    // Ollama default port
      defaultModel: "qwen2.5:14b-instruct-q4_K_M",
      parameters: {
        temperature: 0.7,
        maxTokens: 4096,
        numCtx: 8192,       // Context window size
      }
    }
  }
}

Connecting to LM Studio

{
  models: {
    lmstudio: {
      provider: "openai",                   // LM Studio is compatible with the OpenAI API
      baseUrl: "http://localhost:1234/v1",   // LM Studio default address
      apiKey: "lm-studio",                   // LM Studio does not validate the key
      defaultModel: "loaded-model",          // Uses the currently loaded model
    }
  }
}

Connecting to llama.cpp Server

{
  models: {
    llamacpp: {
      provider: "openai",                      // Compatible with the OpenAI API
      baseUrl: "http://localhost:8080/v1",
      apiKey: "none",
      defaultModel: "local-model",
    }
  }
}

Speed vs. Quality Trade-offs

Factors Affecting Inference Speed

Memory bandwidth: RTX 4090 (1 TB/s) is much faster than RTX 4060 (272 GB/s)
Model size: More parameters means slower inference
Quantization level: Lower quantization is faster but reduces quality
Context length: Longer conversations slow down inference
Concurrency: Multiple simultaneous users reduce speed

Speed Benchmarks (RTX 4090, Q4_K_M Quantization)

Model	Generation Speed (tokens/s)	Experience
3B	120+	Extremely fast
7B	80-100	Very fast
14B	45-60	Smooth
32B	20-30	Acceptable
70B (partial offload)	5-10	Slow

Recommended Models Summary

Use Case	Recommended Model	Notes
Chinese conversation	Qwen 2.5 (7B/14B/32B)	Best Chinese capabilities
English conversation	Llama 3.3 (8B/70B)	Excellent all-around performance
Code generation	DeepSeek Coder V2	Specialized for coding
Reasoning and analysis	DeepSeek R1 (32B)	Chain-of-thought reasoning
Multilingual	Mistral (7B/22B)	Well-balanced across languages
Minimal resources	Llama 3.2 3B	Smallest viable model

Troubleshooting

Cannot Connect to Ollama

# Check if Ollama is running
curl http://localhost:11434/api/version

# If not running, start it manually
ollama serve

Model Fails to Load

Error: model requires more memory than available

Solutions:

Use a lower quantization level (e.g., Q3_K_M)
Switch to a smaller model
Close other programs that are consuming VRAM

Garbled Chinese Text

Some models have poor Chinese support. We recommend using the Qwen or DeepSeek series, which have been specifically optimized for Chinese.

Summary

Local LLMs are the best choice for users who prioritize data privacy and zero-cost operation. Ollama is the most convenient solution for integration with OpenClaw. When VRAM allows, choose the largest model and highest quantization level possible. For Chinese use cases, prioritize the Qwen series; for English, go with Llama. If your budget allows a GPU upgrade, a single RTX 4090 (24GB) is sufficient to run high-quality 32B-class models.