Home Tutorials Categories Skills About
ZH EN JA KO
Model Integration

Local LLM Selection and Deployment Guide for OpenClaw

· 16 min read

Introduction

Local large language models are an important option within the OpenClaw ecosystem. They are completely free to use, keep all data on your machine, and are not subject to network latency. With the rapid advancement of open-source models, running a capable LLM on consumer-grade GPUs is entirely feasible today. This article provides a detailed overview of local model deployment options and hardware-specific selection recommendations.

Comparing Local Deployment Options

There are currently three mainstream approaches for running local models:

Solution Highlights Ideal For Learning Curve
Ollama CLI tool, one-click installation Developers, Linux users Low
LM Studio Graphical interface, model store Beginners, Windows/Mac users Very low
llama.cpp Low-level runtime, maximum flexibility Advanced users, custom needs High

Ollama (Recommended)

Ollama is the most popular local model runtime and offers the most seamless integration with OpenClaw.

Installation:

# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download the installer from ollama.com

Basic usage:

# Download and run a model
ollama pull llama3.3:70b

# Start the Ollama service (usually starts automatically after installation)
ollama serve

# List downloaded models
ollama list

LM Studio

LM Studio provides a user-friendly graphical interface with one-click model downloading and running.

  1. Download and install from lmstudio.ai
  2. Search for and download models in the model store
  3. Start the local server (default port 1234)

llama.cpp

llama.cpp is the underlying inference engine that Ollama is built on. It is best suited for advanced users who need fine-grained control.

# Build (requires cmake)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # GPU acceleration
cmake --build build --config Release

# Run a model
./build/bin/llama-server -m model.gguf --port 8080

Hardware Requirements and Model Selection

Choosing Models by VRAM

Model size and VRAM requirements are directly correlated. Here are recommendations for different VRAM levels:

8GB VRAM (RTX 4060 / RTX 3070, etc.)

Model Parameters Quantization VRAM Usage Performance
Llama 3.2 3B 3B Q8_0 ~4 GB Adequate for simple conversations
Qwen 2.5 7B 7B Q4_K_M ~5 GB Excellent Chinese performance
Mistral 7B 7B Q4_K_M ~5 GB Strong English capabilities
DeepSeek V2 Lite 16B 16B Q3_K_M ~7 GB MoE architecture, fast inference
# Recommended download for 8GB VRAM
ollama pull qwen2.5:7b-instruct-q4_K_M

16GB VRAM (RTX 4080 / RTX 4070 Ti, etc.)

Model Parameters Quantization VRAM Usage Performance
Llama 3.3 8B 8B Q8_0 ~9 GB High quality, recommended
Qwen 2.5 14B 14B Q4_K_M ~10 GB Among the best for Chinese
Mistral Small 22B 22B Q4_K_M ~14 GB Strong multilingual support
DeepSeek V3 Lite 24B Q4_K_M ~15 GB Outstanding reasoning ability
# Recommended downloads for 16GB VRAM
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama pull llama3.3:8b-instruct-q8_0

24GB VRAM (RTX 4090 / RTX 3090, etc.)

Model Parameters Quantization VRAM Usage Performance
Llama 3.3 70B 70B Q4_K_M ~42 GB* Requires CPU offload
Qwen 2.5 32B 32B Q4_K_M ~20 GB Exceptional Chinese ability
DeepSeek R1 32B 32B Q4_K_M ~20 GB Reasoning-enhanced model
Mistral Large 123B 123B Q2_K ~48 GB* Requires CPU offload

*Models exceeding VRAM can be partially offloaded to system memory, but speed will decrease.

# Recommended downloads for 24GB VRAM
ollama pull qwen2.5:32b-instruct-q4_K_M
ollama pull deepseek-r1:32b

No Dedicated GPU / CPU Inference

If you do not have a dedicated GPU, you can still run models using CPU inference, though it will be slower:

# Smaller models recommended for CPU inference
ollama pull llama3.2:3b-instruct-q4_K_M
ollama pull qwen2.5:3b-instruct-q4_K_M

We recommend at least 16GB of RAM for 3B models and 32GB for 7B models.

Quantization Levels Explained

Quantization compresses model weights from high precision (FP16) to lower precision to reduce VRAM usage:

Quantization Level Precision Loss Size (vs FP16) Recommendation
Q8_0 Minimal ~50% Best choice when VRAM allows
Q6_K Very small ~43% Good balance of quality and size
Q5_K_M Small ~37% Recommended
Q4_K_M Moderate ~30% Most commonly used, recommended
Q3_K_M Noticeable ~23% Use when VRAM is tight
Q2_K Significant ~18% Use only as a last resort

Rule of thumb: Q4_K_M offers the best value, and when VRAM permits, go with Q6_K or Q8_0.

OpenClaw Configuration

Connecting to Ollama

{
  models: {
    ollama: {
      provider: "ollama",
      baseUrl: "http://localhost:11434",    // Ollama default port
      defaultModel: "qwen2.5:14b-instruct-q4_K_M",
      parameters: {
        temperature: 0.7,
        maxTokens: 4096,
        numCtx: 8192,       // Context window size
      }
    }
  }
}

Connecting to LM Studio

{
  models: {
    lmstudio: {
      provider: "openai",                   // LM Studio is compatible with the OpenAI API
      baseUrl: "http://localhost:1234/v1",   // LM Studio default address
      apiKey: "lm-studio",                   // LM Studio does not validate the key
      defaultModel: "loaded-model",          // Uses the currently loaded model
    }
  }
}

Connecting to llama.cpp Server

{
  models: {
    llamacpp: {
      provider: "openai",                      // Compatible with the OpenAI API
      baseUrl: "http://localhost:8080/v1",
      apiKey: "none",
      defaultModel: "local-model",
    }
  }
}

Speed vs. Quality Trade-offs

Factors Affecting Inference Speed

  1. Memory bandwidth: RTX 4090 (1 TB/s) is much faster than RTX 4060 (272 GB/s)
  2. Model size: More parameters means slower inference
  3. Quantization level: Lower quantization is faster but reduces quality
  4. Context length: Longer conversations slow down inference
  5. Concurrency: Multiple simultaneous users reduce speed

Speed Benchmarks (RTX 4090, Q4_K_M Quantization)

Model Generation Speed (tokens/s) Experience
3B 120+ Extremely fast
7B 80-100 Very fast
14B 45-60 Smooth
32B 20-30 Acceptable
70B (partial offload) 5-10 Slow

Recommended Models Summary

Use Case Recommended Model Notes
Chinese conversation Qwen 2.5 (7B/14B/32B) Best Chinese capabilities
English conversation Llama 3.3 (8B/70B) Excellent all-around performance
Code generation DeepSeek Coder V2 Specialized for coding
Reasoning and analysis DeepSeek R1 (32B) Chain-of-thought reasoning
Multilingual Mistral (7B/22B) Well-balanced across languages
Minimal resources Llama 3.2 3B Smallest viable model

Troubleshooting

Cannot Connect to Ollama

# Check if Ollama is running
curl http://localhost:11434/api/version

# If not running, start it manually
ollama serve

Model Fails to Load

Error: model requires more memory than available

Solutions:

  • Use a lower quantization level (e.g., Q3_K_M)
  • Switch to a smaller model
  • Close other programs that are consuming VRAM

Garbled Chinese Text

Some models have poor Chinese support. We recommend using the Qwen or DeepSeek series, which have been specifically optimized for Chinese.

Summary

Local LLMs are the best choice for users who prioritize data privacy and zero-cost operation. Ollama is the most convenient solution for integration with OpenClaw. When VRAM allows, choose the largest model and highest quantization level possible. For Chinese use cases, prioritize the Qwen series; for English, go with Llama. If your budget allows a GPU upgrade, a single RTX 4090 (24GB) is sufficient to run high-quality 32B-class models.

OpenClaw is a free, open-source personal AI assistant that supports WhatsApp, Telegram, Discord, and many more platforms