Home Tutorials Categories Skills About
ZH EN JA KO
Advanced

OpenClaw Knowledge Base and RAG Retrieval-Augmented Generation Guide

· 28 min read

Introduction

By default, AI models can only answer questions based on their training data and the current conversation context. In real-world scenarios, however, you often want the AI to reference company documents, product manuals, or personal notes when answering. RAG (Retrieval-Augmented Generation) is the key technology that solves this problem. This article walks you through building a knowledge base system for OpenClaw step by step.

1. How RAG Works

1.1 What Is RAG

The RAG workflow is as follows:

User asks a question
  │
  ├── ① Convert the question into a vector (Embedding)
  │
  ├── ② Search for the most relevant document chunks in the vector database
  │
  ├── ③ Inject the relevant document chunks as context into the prompt
  │
  └── ④ The AI model generates an answer based on the context

1.2 RAG vs. Direct File Upload

Approach Pros Cons
Paste text directly Simple and straightforward Limited by context window
RAG retrieval Supports massive document volumes Requires additional setup
Fine-tune the model Knowledge is internalized Expensive and inflexible

For most teams, RAG offers the best cost-performance ratio.

2. Configuring the Knowledge Base Directory

2.1 Basic Configuration

// ~/.config/openclaw/openclaw.json5
{
  "knowledgeBase": {
    "enabled": true,
    // Knowledge base document directories (supports multiple)
    "directories": [
      {
        "path": "/data/knowledge/company-docs",
        "name": "Company Documents",
        "description": "Company policies, processes, standards, etc."
      },
      {
        "path": "/data/knowledge/product-docs",
        "name": "Product Documentation",
        "description": "Product user manuals, API documentation, etc."
      },
      {
        "path": "/home/user/notes",
        "name": "Personal Notes",
        "description": "Personal study notes and memos"
      }
    ],
    // File change monitoring
    "watchChanges": true,
    // Auto re-index after changes
    "autoReindex": true
  }
}

2.2 Creating Knowledge Base Directories

# Create document directories
sudo mkdir -p /data/knowledge/company-docs
sudo mkdir -p /data/knowledge/product-docs
sudo chown -R $USER:$USER /data/knowledge

# Add your documents
cp ~/Documents/company-policies.pdf /data/knowledge/company-docs/
cp ~/Documents/api-docs.md /data/knowledge/product-docs/
cp -r ~/Documents/technical-specs/ /data/knowledge/company-docs/

3. Supported File Formats

3.1 File Format Support List

Format Extension Support Level Notes
Markdown .md Full support Recommended format, preserves structure
Plain text .txt Full support Simple and direct
PDF .pdf Full support Automatic text extraction
Word .docx Full support Automatic text and table extraction
HTML .html Supported Extracts body content
CSV .csv Supported Tabular data
JSON .json Supported Structured data
EPUB .epub Basic support E-book content
Code files .py/.js/.go etc. Supported Code comments and documentation

3.2 File Format Configuration

{
  "knowledgeBase": {
    "fileTypes": {
      "include": [".md", ".txt", ".pdf", ".docx", ".html", ".csv"],
      "exclude": [".tmp", ".bak", ".log"],
      // Maximum single file size
      "maxFileSize": "10MB",
      // Directories to ignore
      "ignoreDirs": ["node_modules", ".git", "__pycache__"]
    }
  }
}

3.3 PDF Processing Optimization

# Install PDF processing dependencies (if PDF parsing issues arise)
sudo apt install -y poppler-utils

# Test PDF text extraction
pdftotext /data/knowledge/company-docs/policies.pdf -

4. Vector Store Configuration

4.1 Built-In Vector Store (Default)

OpenClaw includes a lightweight built-in vector store engine, suitable for small to medium knowledge bases:

{
  "knowledgeBase": {
    "vectorStore": {
      "type": "builtin",
      "storagePath": "/var/lib/openclaw/vectors",
      // Vector dimensions (determined by the Embedding model)
      "dimensions": 1536,
      // Similarity algorithm
      "similarityMetric": "cosine"
    }
  }
}

4.2 Using ChromaDB

For larger knowledge bases, ChromaDB is recommended:

# Install ChromaDB
docker run -d \
  --name chromadb \
  --restart=always \
  -p 8000:8000 \
  -v chroma-data:/chroma/chroma \
  chromadb/chroma:latest
{
  "knowledgeBase": {
    "vectorStore": {
      "type": "chroma",
      "url": "http://localhost:8000",
      "collection": "openclaw_knowledge"
    }
  }
}

4.3 Using Qdrant

Qdrant provides more powerful filtering and search capabilities:

# Install Qdrant
docker run -d \
  --name qdrant \
  --restart=always \
  -p 6333:6333 \
  -v qdrant-data:/qdrant/storage \
  qdrant/qdrant:latest
{
  "knowledgeBase": {
    "vectorStore": {
      "type": "qdrant",
      "url": "http://localhost:6333",
      "collection": "openclaw_knowledge"
    }
  }
}

4.4 Vector Store Comparison

Store Document Scale Query Speed Deployment Complexity Use Case
Built-in < 10K chunks Fast No extra deployment needed Personal / small team
ChromaDB < 100K chunks Fast Simple Medium scale
Qdrant < 1M chunks Very fast Moderate Large scale / production

5. Chunking Strategies

Documents need to be split into appropriately sized chunks so they can be effectively indexed and retrieved.

5.1 Chunking Parameter Configuration

{
  "knowledgeBase": {
    "chunking": {
      // Chunking strategy
      "strategy": "recursive",
      // Target size per chunk (characters)
      "chunkSize": 1000,
      // Overlap between adjacent chunks
      "chunkOverlap": 200,
      // Separator priority (highest to lowest)
      "separators": ["\n## ", "\n### ", "\n\n", "\n", "。", ".", " "],
      // Preserve document metadata
      "preserveMetadata": true
    }
  }
}

5.2 Chunking Strategy Descriptions

Strategy Description Use Case
fixed Fixed character count split Text without clear structure
recursive Recursive split by separators General documents (recommended)
markdown Split by Markdown headings Markdown files
semantic Split by semantic boundaries High-quality requirements

5.3 Choosing the Right Chunk Size

Chunks too large (> 2000 characters):
  ✗ Low retrieval precision, includes irrelevant content
  ✗ Consumes more of the context window

Chunks too small (< 200 characters):
  ✗ Lacks context, information is incomplete
  ✗ Increases the number of retrieval queries

Recommended range: 500-1500 characters, with 100-300 character overlap

6. Embedding Model Configuration

6.1 Choosing an Embedding Model

{
  "knowledgeBase": {
    "embedding": {
      // Using OpenAI Embedding
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dimensions": 1536,
      "batchSize": 100
    }
  }
}

6.2 Embedding Model Comparison

Model Dimensions Quality Speed Cost
OpenAI text-embedding-3-large 3072 Highest Fast $0.13/1M tokens
OpenAI text-embedding-3-small 1536 High Very fast $0.02/1M tokens
Ollama nomic-embed-text 768 Medium Medium Free (local)
Ollama mxbai-embed-large 1024 Medium-High Medium Free (local)

6.3 Using Local Embeddings (Ollama)

# Download the Embedding model
ollama pull nomic-embed-text
{
  "knowledgeBase": {
    "embedding": {
      "provider": "ollama",
      "model": "nomic-embed-text",
      "dimensions": 768,
      "baseUrl": "http://localhost:11434"
    }
  }
}

7. Index Management

7.1 Building the Index

# Build the knowledge base index for the first time
openclaw knowledge index

# Example output:
# Scanning directory: /data/knowledge/company-docs
# Files found: 45
# Processing... ████████████████████ 100%
# Vectors generated: 1,234 document chunks
# Indexing complete! Time elapsed: 2m 15s

# Update the index (only processes changed files)
openclaw knowledge index --update

# Rebuild the index (clear and rebuild from scratch)
openclaw knowledge index --rebuild

7.2 Viewing Index Status

# View knowledge base status
openclaw knowledge status

# Example output:
# ┌──────────────────┬────────┬──────────┬──────────────┐
# │ Knowledge Base   │ Files  │ Chunks   │ Last Updated │
# ├──────────────────┼────────┼──────────┼──────────────┤
# │ Company Docs     │ 45     │ 892      │ 2026-04-09   │
# │ Product Docs     │ 23     │ 342      │ 2026-04-08   │
# │ Personal Notes   │ 67     │ 234      │ 2026-04-09   │
# ├──────────────────┼────────┼──────────┼──────────────┤
# │ Total            │ 135    │ 1,468    │              │
# └──────────────────┴────────┴──────────┴──────────────┘

7.3 Testing Retrieval Quality

# Test retrieval
openclaw knowledge search "What is the annual leave policy"

# Outputs retrieved document chunks with similarity scores
# [0.92] company-docs/employee-handbook.pdf (Chapter 3)
#   "Full-time employees are entitled to 5 days of paid annual leave after one year..."
# [0.87] company-docs/hr-policies.md
#   "Annual leave requests must be submitted at least 3 business days in advance..."

8. Retrieval Optimization

8.1 Query Optimization Configuration

{
  "knowledgeBase": {
    "retrieval": {
      // Return the top K most relevant results
      "topK": 5,
      // Minimum similarity threshold (results below this are discarded)
      "minScore": 0.7,
      // Enable query rewriting (AI optimizes search keywords)
      "queryRewrite": true,
      // Enable hybrid search (vector + keyword)
      "hybridSearch": true,
      "hybridWeight": 0.7  // Vector search weight
    }
  }
}

8.2 Injecting Retrieval Results into the Prompt

{
  "knowledgeBase": {
    "promptTemplate": {
      // Template for injecting retrieval results into the system prompt
      "prefix": "Please answer the user's question based on the following reference materials. If the reference materials do not contain relevant information, please say so honestly.\n\n---\nReference Materials:\n",
      "suffix": "\n---\n\nPlease answer based on the reference materials above:",
      // Maximum tokens to inject
      "maxTokens": 4000
    }
  }
}

9. Real-World Use Cases

9.1 Enterprise Internal Knowledge Base

# Place company documents in the knowledge base
/data/knowledge/company-docs/
├── employee-handbook.pdf
├── expense-reimbursement.md
├── technical-specs/
│   ├── coding-standards.md
│   ├── deployment-process.md
│   └── security-standards.md
└── product-docs/
    ├── user-guide.pdf
    └── api-docs.md

When a user asks in Telegram: "What is the reimbursement process?", OpenClaw automatically retrieves the relevant documents and provides an accurate answer.

9.2 Personal Notes Assistant

# Connect Obsidian notes directly
/home/user/notes/
├── tech-learning/
│   ├── docker-intro.md
│   ├── kubernetes-notes.md
│   └── linux-command-reference.md
├── book-notes/
└── project-records/

9.3 Customer Service Knowledge Base

{
  "knowledgeBase": {
    "directories": [
      {
        "path": "/data/knowledge/faq",
        "name": "FAQ",
        "description": "Frequently asked questions and standard answers"
      },
      {
        "path": "/data/knowledge/product-manual",
        "name": "Product Manual",
        "description": "Product feature descriptions and usage tutorials"
      }
    ]
  }
}

10. Maintenance and Best Practices

10.1 Document Quality Affects Retrieval Quality

  • Keep documents well-structured (use headings, lists, etc.)
  • Avoid large amounts of duplicate content
  • Regularly clean out outdated documents
  • Markdown format yields the best chunking results

10.2 Regular Maintenance

# Rebuild the index weekly (cron)
0 3 * * 0 openclaw knowledge index --rebuild >> /var/log/openclaw-index.log 2>&1

# Check index health
openclaw knowledge status

10.3 Performance Optimization

Optimization Target Method
Slow indexing Increase batchSize, use a faster Embedding model
Inaccurate retrieval Adjust chunkSize and overlap, enable hybrid search
Irrelevant results Increase the minScore threshold, reduce topK
High memory usage Use an external vector store (ChromaDB/Qdrant)

By building a knowledge base and RAG system, your OpenClaw will evolve from a general-purpose AI assistant into a dedicated knowledge Q&A expert, capable of accurately answering questions based on your own documents.

OpenClaw is a free, open-source personal AI assistant that supports WhatsApp, Telegram, Discord, and many more platforms