Introduction
By default, AI models can only answer questions based on their training data and the current conversation context. In real-world scenarios, however, you often want the AI to reference company documents, product manuals, or personal notes when answering. RAG (Retrieval-Augmented Generation) is the key technology that solves this problem. This article walks you through building a knowledge base system for OpenClaw step by step.
1. How RAG Works
1.1 What Is RAG
The RAG workflow is as follows:
User asks a question
│
├── ① Convert the question into a vector (Embedding)
│
├── ② Search for the most relevant document chunks in the vector database
│
├── ③ Inject the relevant document chunks as context into the prompt
│
└── ④ The AI model generates an answer based on the context
1.2 RAG vs. Direct File Upload
| Approach | Pros | Cons |
|---|---|---|
| Paste text directly | Simple and straightforward | Limited by context window |
| RAG retrieval | Supports massive document volumes | Requires additional setup |
| Fine-tune the model | Knowledge is internalized | Expensive and inflexible |
For most teams, RAG offers the best cost-performance ratio.
2. Configuring the Knowledge Base Directory
2.1 Basic Configuration
// ~/.config/openclaw/openclaw.json5
{
"knowledgeBase": {
"enabled": true,
// Knowledge base document directories (supports multiple)
"directories": [
{
"path": "/data/knowledge/company-docs",
"name": "Company Documents",
"description": "Company policies, processes, standards, etc."
},
{
"path": "/data/knowledge/product-docs",
"name": "Product Documentation",
"description": "Product user manuals, API documentation, etc."
},
{
"path": "/home/user/notes",
"name": "Personal Notes",
"description": "Personal study notes and memos"
}
],
// File change monitoring
"watchChanges": true,
// Auto re-index after changes
"autoReindex": true
}
}
2.2 Creating Knowledge Base Directories
# Create document directories
sudo mkdir -p /data/knowledge/company-docs
sudo mkdir -p /data/knowledge/product-docs
sudo chown -R $USER:$USER /data/knowledge
# Add your documents
cp ~/Documents/company-policies.pdf /data/knowledge/company-docs/
cp ~/Documents/api-docs.md /data/knowledge/product-docs/
cp -r ~/Documents/technical-specs/ /data/knowledge/company-docs/
3. Supported File Formats
3.1 File Format Support List
| Format | Extension | Support Level | Notes |
|---|---|---|---|
| Markdown | .md | Full support | Recommended format, preserves structure |
| Plain text | .txt | Full support | Simple and direct |
| Full support | Automatic text extraction | ||
| Word | .docx | Full support | Automatic text and table extraction |
| HTML | .html | Supported | Extracts body content |
| CSV | .csv | Supported | Tabular data |
| JSON | .json | Supported | Structured data |
| EPUB | .epub | Basic support | E-book content |
| Code files | .py/.js/.go etc. | Supported | Code comments and documentation |
3.2 File Format Configuration
{
"knowledgeBase": {
"fileTypes": {
"include": [".md", ".txt", ".pdf", ".docx", ".html", ".csv"],
"exclude": [".tmp", ".bak", ".log"],
// Maximum single file size
"maxFileSize": "10MB",
// Directories to ignore
"ignoreDirs": ["node_modules", ".git", "__pycache__"]
}
}
}
3.3 PDF Processing Optimization
# Install PDF processing dependencies (if PDF parsing issues arise)
sudo apt install -y poppler-utils
# Test PDF text extraction
pdftotext /data/knowledge/company-docs/policies.pdf -
4. Vector Store Configuration
4.1 Built-In Vector Store (Default)
OpenClaw includes a lightweight built-in vector store engine, suitable for small to medium knowledge bases:
{
"knowledgeBase": {
"vectorStore": {
"type": "builtin",
"storagePath": "/var/lib/openclaw/vectors",
// Vector dimensions (determined by the Embedding model)
"dimensions": 1536,
// Similarity algorithm
"similarityMetric": "cosine"
}
}
}
4.2 Using ChromaDB
For larger knowledge bases, ChromaDB is recommended:
# Install ChromaDB
docker run -d \
--name chromadb \
--restart=always \
-p 8000:8000 \
-v chroma-data:/chroma/chroma \
chromadb/chroma:latest
{
"knowledgeBase": {
"vectorStore": {
"type": "chroma",
"url": "http://localhost:8000",
"collection": "openclaw_knowledge"
}
}
}
4.3 Using Qdrant
Qdrant provides more powerful filtering and search capabilities:
# Install Qdrant
docker run -d \
--name qdrant \
--restart=always \
-p 6333:6333 \
-v qdrant-data:/qdrant/storage \
qdrant/qdrant:latest
{
"knowledgeBase": {
"vectorStore": {
"type": "qdrant",
"url": "http://localhost:6333",
"collection": "openclaw_knowledge"
}
}
}
4.4 Vector Store Comparison
| Store | Document Scale | Query Speed | Deployment Complexity | Use Case |
|---|---|---|---|---|
| Built-in | < 10K chunks | Fast | No extra deployment needed | Personal / small team |
| ChromaDB | < 100K chunks | Fast | Simple | Medium scale |
| Qdrant | < 1M chunks | Very fast | Moderate | Large scale / production |
5. Chunking Strategies
Documents need to be split into appropriately sized chunks so they can be effectively indexed and retrieved.
5.1 Chunking Parameter Configuration
{
"knowledgeBase": {
"chunking": {
// Chunking strategy
"strategy": "recursive",
// Target size per chunk (characters)
"chunkSize": 1000,
// Overlap between adjacent chunks
"chunkOverlap": 200,
// Separator priority (highest to lowest)
"separators": ["\n## ", "\n### ", "\n\n", "\n", "。", ".", " "],
// Preserve document metadata
"preserveMetadata": true
}
}
}
5.2 Chunking Strategy Descriptions
| Strategy | Description | Use Case |
|---|---|---|
fixed |
Fixed character count split | Text without clear structure |
recursive |
Recursive split by separators | General documents (recommended) |
markdown |
Split by Markdown headings | Markdown files |
semantic |
Split by semantic boundaries | High-quality requirements |
5.3 Choosing the Right Chunk Size
Chunks too large (> 2000 characters):
✗ Low retrieval precision, includes irrelevant content
✗ Consumes more of the context window
Chunks too small (< 200 characters):
✗ Lacks context, information is incomplete
✗ Increases the number of retrieval queries
Recommended range: 500-1500 characters, with 100-300 character overlap
6. Embedding Model Configuration
6.1 Choosing an Embedding Model
{
"knowledgeBase": {
"embedding": {
// Using OpenAI Embedding
"provider": "openai",
"model": "text-embedding-3-small",
"dimensions": 1536,
"batchSize": 100
}
}
}
6.2 Embedding Model Comparison
| Model | Dimensions | Quality | Speed | Cost |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Highest | Fast | $0.13/1M tokens |
| OpenAI text-embedding-3-small | 1536 | High | Very fast | $0.02/1M tokens |
| Ollama nomic-embed-text | 768 | Medium | Medium | Free (local) |
| Ollama mxbai-embed-large | 1024 | Medium-High | Medium | Free (local) |
6.3 Using Local Embeddings (Ollama)
# Download the Embedding model
ollama pull nomic-embed-text
{
"knowledgeBase": {
"embedding": {
"provider": "ollama",
"model": "nomic-embed-text",
"dimensions": 768,
"baseUrl": "http://localhost:11434"
}
}
}
7. Index Management
7.1 Building the Index
# Build the knowledge base index for the first time
openclaw knowledge index
# Example output:
# Scanning directory: /data/knowledge/company-docs
# Files found: 45
# Processing... ████████████████████ 100%
# Vectors generated: 1,234 document chunks
# Indexing complete! Time elapsed: 2m 15s
# Update the index (only processes changed files)
openclaw knowledge index --update
# Rebuild the index (clear and rebuild from scratch)
openclaw knowledge index --rebuild
7.2 Viewing Index Status
# View knowledge base status
openclaw knowledge status
# Example output:
# ┌──────────────────┬────────┬──────────┬──────────────┐
# │ Knowledge Base │ Files │ Chunks │ Last Updated │
# ├──────────────────┼────────┼──────────┼──────────────┤
# │ Company Docs │ 45 │ 892 │ 2026-04-09 │
# │ Product Docs │ 23 │ 342 │ 2026-04-08 │
# │ Personal Notes │ 67 │ 234 │ 2026-04-09 │
# ├──────────────────┼────────┼──────────┼──────────────┤
# │ Total │ 135 │ 1,468 │ │
# └──────────────────┴────────┴──────────┴──────────────┘
7.3 Testing Retrieval Quality
# Test retrieval
openclaw knowledge search "What is the annual leave policy"
# Outputs retrieved document chunks with similarity scores
# [0.92] company-docs/employee-handbook.pdf (Chapter 3)
# "Full-time employees are entitled to 5 days of paid annual leave after one year..."
# [0.87] company-docs/hr-policies.md
# "Annual leave requests must be submitted at least 3 business days in advance..."
8. Retrieval Optimization
8.1 Query Optimization Configuration
{
"knowledgeBase": {
"retrieval": {
// Return the top K most relevant results
"topK": 5,
// Minimum similarity threshold (results below this are discarded)
"minScore": 0.7,
// Enable query rewriting (AI optimizes search keywords)
"queryRewrite": true,
// Enable hybrid search (vector + keyword)
"hybridSearch": true,
"hybridWeight": 0.7 // Vector search weight
}
}
}
8.2 Injecting Retrieval Results into the Prompt
{
"knowledgeBase": {
"promptTemplate": {
// Template for injecting retrieval results into the system prompt
"prefix": "Please answer the user's question based on the following reference materials. If the reference materials do not contain relevant information, please say so honestly.\n\n---\nReference Materials:\n",
"suffix": "\n---\n\nPlease answer based on the reference materials above:",
// Maximum tokens to inject
"maxTokens": 4000
}
}
}
9. Real-World Use Cases
9.1 Enterprise Internal Knowledge Base
# Place company documents in the knowledge base
/data/knowledge/company-docs/
├── employee-handbook.pdf
├── expense-reimbursement.md
├── technical-specs/
│ ├── coding-standards.md
│ ├── deployment-process.md
│ └── security-standards.md
└── product-docs/
├── user-guide.pdf
└── api-docs.md
When a user asks in Telegram: "What is the reimbursement process?", OpenClaw automatically retrieves the relevant documents and provides an accurate answer.
9.2 Personal Notes Assistant
# Connect Obsidian notes directly
/home/user/notes/
├── tech-learning/
│ ├── docker-intro.md
│ ├── kubernetes-notes.md
│ └── linux-command-reference.md
├── book-notes/
└── project-records/
9.3 Customer Service Knowledge Base
{
"knowledgeBase": {
"directories": [
{
"path": "/data/knowledge/faq",
"name": "FAQ",
"description": "Frequently asked questions and standard answers"
},
{
"path": "/data/knowledge/product-manual",
"name": "Product Manual",
"description": "Product feature descriptions and usage tutorials"
}
]
}
}
10. Maintenance and Best Practices
10.1 Document Quality Affects Retrieval Quality
- Keep documents well-structured (use headings, lists, etc.)
- Avoid large amounts of duplicate content
- Regularly clean out outdated documents
- Markdown format yields the best chunking results
10.2 Regular Maintenance
# Rebuild the index weekly (cron)
0 3 * * 0 openclaw knowledge index --rebuild >> /var/log/openclaw-index.log 2>&1
# Check index health
openclaw knowledge status
10.3 Performance Optimization
| Optimization Target | Method |
|---|---|
| Slow indexing | Increase batchSize, use a faster Embedding model |
| Inaccurate retrieval | Adjust chunkSize and overlap, enable hybrid search |
| Irrelevant results | Increase the minScore threshold, reduce topK |
| High memory usage | Use an external vector store (ChromaDB/Qdrant) |
By building a knowledge base and RAG system, your OpenClaw will evolve from a general-purpose AI assistant into a dedicated knowledge Q&A expert, capable of accurately answering questions based on your own documents.