OpenClaw Knowledge Base and RAG Retrieval-Augmented Generation Guide

Introduction

By default, AI models can only answer questions based on their training data and the current conversation context. In real-world scenarios, however, you often want the AI to reference company documents, product manuals, or personal notes when answering. RAG (Retrieval-Augmented Generation) is the key technology that solves this problem. This article walks you through building a knowledge base system for OpenClaw step by step.

1. How RAG Works

1.1 What Is RAG

The RAG workflow is as follows:

User asks a question
  │
  ├── ① Convert the question into a vector (Embedding)
  │
  ├── ② Search for the most relevant document chunks in the vector database
  │
  ├── ③ Inject the relevant document chunks as context into the prompt
  │
  └── ④ The AI model generates an answer based on the context

1.2 RAG vs. Direct File Upload

Approach	Pros	Cons
Paste text directly	Simple and straightforward	Limited by context window
RAG retrieval	Supports massive document volumes	Requires additional setup
Fine-tune the model	Knowledge is internalized	Expensive and inflexible

For most teams, RAG offers the best cost-performance ratio.

2. Configuring the Knowledge Base Directory

2.1 Basic Configuration

// ~/.config/openclaw/openclaw.json5
{
  "knowledgeBase": {
    "enabled": true,
    // Knowledge base document directories (supports multiple)
    "directories": [
      {
        "path": "/data/knowledge/company-docs",
        "name": "Company Documents",
        "description": "Company policies, processes, standards, etc."
      },
      {
        "path": "/data/knowledge/product-docs",
        "name": "Product Documentation",
        "description": "Product user manuals, API documentation, etc."
      },
      {
        "path": "/home/user/notes",
        "name": "Personal Notes",
        "description": "Personal study notes and memos"
      }
    ],
    // File change monitoring
    "watchChanges": true,
    // Auto re-index after changes
    "autoReindex": true
  }
}

2.2 Creating Knowledge Base Directories

# Create document directories
sudo mkdir -p /data/knowledge/company-docs
sudo mkdir -p /data/knowledge/product-docs
sudo chown -R $USER:$USER /data/knowledge

# Add your documents
cp ~/Documents/company-policies.pdf /data/knowledge/company-docs/
cp ~/Documents/api-docs.md /data/knowledge/product-docs/
cp -r ~/Documents/technical-specs/ /data/knowledge/company-docs/

3. Supported File Formats

3.1 File Format Support List

Format	Extension	Support Level	Notes
Markdown	.md	Full support	Recommended format, preserves structure
Plain text	.txt	Full support	Simple and direct
PDF	.pdf	Full support	Automatic text extraction
Word	.docx	Full support	Automatic text and table extraction
HTML	.html	Supported	Extracts body content
CSV	.csv	Supported	Tabular data
JSON	.json	Supported	Structured data
EPUB	.epub	Basic support	E-book content
Code files	.py/.js/.go etc.	Supported	Code comments and documentation

3.2 File Format Configuration

{
  "knowledgeBase": {
    "fileTypes": {
      "include": [".md", ".txt", ".pdf", ".docx", ".html", ".csv"],
      "exclude": [".tmp", ".bak", ".log"],
      // Maximum single file size
      "maxFileSize": "10MB",
      // Directories to ignore
      "ignoreDirs": ["node_modules", ".git", "__pycache__"]
    }
  }
}

3.3 PDF Processing Optimization

# Install PDF processing dependencies (if PDF parsing issues arise)
sudo apt install -y poppler-utils

# Test PDF text extraction
pdftotext /data/knowledge/company-docs/policies.pdf -

4. Vector Store Configuration

4.1 Built-In Vector Store (Default)

OpenClaw includes a lightweight built-in vector store engine, suitable for small to medium knowledge bases:

{
  "knowledgeBase": {
    "vectorStore": {
      "type": "builtin",
      "storagePath": "/var/lib/openclaw/vectors",
      // Vector dimensions (determined by the Embedding model)
      "dimensions": 1536,
      // Similarity algorithm
      "similarityMetric": "cosine"
    }
  }
}

4.2 Using ChromaDB

For larger knowledge bases, ChromaDB is recommended:

# Install ChromaDB
docker run -d \
  --name chromadb \
  --restart=always \
  -p 8000:8000 \
  -v chroma-data:/chroma/chroma \
  chromadb/chroma:latest

{
  "knowledgeBase": {
    "vectorStore": {
      "type": "chroma",
      "url": "http://localhost:8000",
      "collection": "openclaw_knowledge"
    }
  }
}

4.3 Using Qdrant

Qdrant provides more powerful filtering and search capabilities:

# Install Qdrant
docker run -d \
  --name qdrant \
  --restart=always \
  -p 6333:6333 \
  -v qdrant-data:/qdrant/storage \
  qdrant/qdrant:latest

{
  "knowledgeBase": {
    "vectorStore": {
      "type": "qdrant",
      "url": "http://localhost:6333",
      "collection": "openclaw_knowledge"
    }
  }
}

4.4 Vector Store Comparison

Store	Document Scale	Query Speed	Deployment Complexity	Use Case
Built-in	< 10K chunks	Fast	No extra deployment needed	Personal / small team
ChromaDB	< 100K chunks	Fast	Simple	Medium scale
Qdrant	< 1M chunks	Very fast	Moderate	Large scale / production

5. Chunking Strategies

Documents need to be split into appropriately sized chunks so they can be effectively indexed and retrieved.

5.1 Chunking Parameter Configuration

{
  "knowledgeBase": {
    "chunking": {
      // Chunking strategy
      "strategy": "recursive",
      // Target size per chunk (characters)
      "chunkSize": 1000,
      // Overlap between adjacent chunks
      "chunkOverlap": 200,
      // Separator priority (highest to lowest)
      "separators": ["\n## ", "\n### ", "\n\n", "\n", "。", ".", " "],
      // Preserve document metadata
      "preserveMetadata": true
    }
  }
}

5.2 Chunking Strategy Descriptions

Strategy	Description	Use Case
`fixed`	Fixed character count split	Text without clear structure
`recursive`	Recursive split by separators	General documents (recommended)
`markdown`	Split by Markdown headings	Markdown files
`semantic`	Split by semantic boundaries	High-quality requirements

5.3 Choosing the Right Chunk Size

Chunks too large (> 2000 characters):
  ✗ Low retrieval precision, includes irrelevant content
  ✗ Consumes more of the context window

Chunks too small (< 200 characters):
  ✗ Lacks context, information is incomplete
  ✗ Increases the number of retrieval queries

Recommended range: 500-1500 characters, with 100-300 character overlap

6. Embedding Model Configuration

6.1 Choosing an Embedding Model

{
  "knowledgeBase": {
    "embedding": {
      // Using OpenAI Embedding
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dimensions": 1536,
      "batchSize": 100
    }
  }
}

6.2 Embedding Model Comparison

Model	Dimensions	Quality	Speed	Cost
OpenAI text-embedding-3-large	3072	Highest	Fast	$0.13/1M tokens
OpenAI text-embedding-3-small	1536	High	Very fast	$0.02/1M tokens
Ollama nomic-embed-text	768	Medium	Medium	Free (local)
Ollama mxbai-embed-large	1024	Medium-High	Medium	Free (local)

6.3 Using Local Embeddings (Ollama)

# Download the Embedding model
ollama pull nomic-embed-text

{
  "knowledgeBase": {
    "embedding": {
      "provider": "ollama",
      "model": "nomic-embed-text",
      "dimensions": 768,
      "baseUrl": "http://localhost:11434"
    }
  }
}

7. Index Management

7.1 Building the Index

# Build the knowledge base index for the first time
openclaw knowledge index

# Example output:
# Scanning directory: /data/knowledge/company-docs
# Files found: 45
# Processing... ████████████████████ 100%
# Vectors generated: 1,234 document chunks
# Indexing complete! Time elapsed: 2m 15s

# Update the index (only processes changed files)
openclaw knowledge index --update

# Rebuild the index (clear and rebuild from scratch)
openclaw knowledge index --rebuild

7.2 Viewing Index Status

# View knowledge base status
openclaw knowledge status

# Example output:
# ┌──────────────────┬────────┬──────────┬──────────────┐
# │ Knowledge Base   │ Files  │ Chunks   │ Last Updated │
# ├──────────────────┼────────┼──────────┼──────────────┤
# │ Company Docs     │ 45     │ 892      │ 2026-04-09   │
# │ Product Docs     │ 23     │ 342      │ 2026-04-08   │
# │ Personal Notes   │ 67     │ 234      │ 2026-04-09   │
# ├──────────────────┼────────┼──────────┼──────────────┤
# │ Total            │ 135    │ 1,468    │              │
# └──────────────────┴────────┴──────────┴──────────────┘

7.3 Testing Retrieval Quality

# Test retrieval
openclaw knowledge search "What is the annual leave policy"

# Outputs retrieved document chunks with similarity scores
# [0.92] company-docs/employee-handbook.pdf (Chapter 3)
#   "Full-time employees are entitled to 5 days of paid annual leave after one year..."
# [0.87] company-docs/hr-policies.md
#   "Annual leave requests must be submitted at least 3 business days in advance..."

8. Retrieval Optimization

8.1 Query Optimization Configuration

{
  "knowledgeBase": {
    "retrieval": {
      // Return the top K most relevant results
      "topK": 5,
      // Minimum similarity threshold (results below this are discarded)
      "minScore": 0.7,
      // Enable query rewriting (AI optimizes search keywords)
      "queryRewrite": true,
      // Enable hybrid search (vector + keyword)
      "hybridSearch": true,
      "hybridWeight": 0.7  // Vector search weight
    }
  }
}

8.2 Injecting Retrieval Results into the Prompt

{
  "knowledgeBase": {
    "promptTemplate": {
      // Template for injecting retrieval results into the system prompt
      "prefix": "Please answer the user's question based on the following reference materials. If the reference materials do not contain relevant information, please say so honestly.\n\n---\nReference Materials:\n",
      "suffix": "\n---\n\nPlease answer based on the reference materials above:",
      // Maximum tokens to inject
      "maxTokens": 4000
    }
  }
}

9. Real-World Use Cases

9.1 Enterprise Internal Knowledge Base

# Place company documents in the knowledge base
/data/knowledge/company-docs/
├── employee-handbook.pdf
├── expense-reimbursement.md
├── technical-specs/
│   ├── coding-standards.md
│   ├── deployment-process.md
│   └── security-standards.md
└── product-docs/
    ├── user-guide.pdf
    └── api-docs.md

When a user asks in Telegram: "What is the reimbursement process?", OpenClaw automatically retrieves the relevant documents and provides an accurate answer.

9.2 Personal Notes Assistant

# Connect Obsidian notes directly
/home/user/notes/
├── tech-learning/
│   ├── docker-intro.md
│   ├── kubernetes-notes.md
│   └── linux-command-reference.md
├── book-notes/
└── project-records/

9.3 Customer Service Knowledge Base

{
  "knowledgeBase": {
    "directories": [
      {
        "path": "/data/knowledge/faq",
        "name": "FAQ",
        "description": "Frequently asked questions and standard answers"
      },
      {
        "path": "/data/knowledge/product-manual",
        "name": "Product Manual",
        "description": "Product feature descriptions and usage tutorials"
      }
    ]
  }
}

10. Maintenance and Best Practices

10.1 Document Quality Affects Retrieval Quality

Keep documents well-structured (use headings, lists, etc.)
Avoid large amounts of duplicate content
Regularly clean out outdated documents
Markdown format yields the best chunking results

10.2 Regular Maintenance

# Rebuild the index weekly (cron)
0 3 * * 0 openclaw knowledge index --rebuild >> /var/log/openclaw-index.log 2>&1

# Check index health
openclaw knowledge status

10.3 Performance Optimization

Optimization Target	Method
Slow indexing	Increase batchSize, use a faster Embedding model
Inaccurate retrieval	Adjust chunkSize and overlap, enable hybrid search
Irrelevant results	Increase the minScore threshold, reduce topK
High memory usage	Use an external vector store (ChromaDB/Qdrant)

By building a knowledge base and RAG system, your OpenClaw will evolve from a general-purpose AI assistant into a dedicated knowledge Q&A expert, capable of accurately answering questions based on your own documents.