Three fixes for the 'I apologize, couldnt generate a response' bug: 1. Safety net: if _clean_tool_syntax strips ALL content (e.g. the LLM output only the JSON tool call block and nothing else), return the original content instead of the useless error message. 2. Detailed logging: now logs the first 300 chars of every LLM response so we can see exactly what the model outputs. Also logs which parse pattern matched and which tool names were found. 3. Desperate fallback parser (Pattern 4): if none of the regex/brace patterns match, tries to json.loads() the entire content and looks for known tool names. Catches LLMs that output the array directly or use slightly different formatting. |
||
|---|---|---|
| rag | ||
| tools | ||
| .gitignore | ||
| main.py | ||
| README.md | ||
| requirements.txt | ||
| tools.md | ||
| website_downloader_tool.py | ||
| website_downloader.py | ||
DocRAG - OpenAI-Compatible RAG Server
A custom RAG (Retrieval-Augmented Generation) system that appears as a standard OpenAI API server to clients like Open WebUI. Behind the scenes, it:
- Processes user queries through a RAG system
- Retrieves relevant context from a knowledge base
- Passes the enriched context to GLM-4.7-Flash for response generation
- Optionally uses tools like website_downloader for enhanced capabilities
Users interact with what appears to be a normal chat experience, while sophisticated RAG operations happen transparently in the background.
Features
- OpenAI-Compatible API: Works with any OpenAI client (Open WebUI, custom apps, etc.)
- RAG Integration: Automatic context retrieval for enhanced responses
- Document Management: Upload and manage documents in the knowledge base
- Tool Support: Built-in tools like website_downloader for extended capabilities
- Streaming Support: Real-time streaming responses
- Easy Configuration: Environment-based configuration
Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Configure Environment
cp .env.example .env
# Edit .env and add your ZAI_API_KEY
3. Run the Server
python main.py
The server will start on http://0.0.0.0:8000
4. Use with Open WebUI
- Open Open WebUI settings
- Add a new OpenAI-compatible connection
- Set the base URL to
http://your-server:8000/v1 - Leave the API key empty or use any value (not validated)
- Select the "DocRAG-GLM-4.7" model
API Endpoints
OpenAI-Compatible Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions (streaming supported) |
/v1/models |
GET | List available models |
/v1/models/{model_id} |
GET | Get model information |
Document Management Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/documents |
GET | List documents in knowledge base |
/v1/documents/upload |
POST | Upload a document |
/v1/documents/url |
POST | Add document from URL |
/v1/documents/{doc_id} |
DELETE | Delete a document |
Health & Status
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/ |
GET | API information |
Usage Examples
Chat Completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DocRAG-GLM-4.7",
"messages": [
{"role": "user", "content": "What is machine learning?"}
],
"stream": false
}'
Upload Document
curl -X POST http://localhost:8000/v1/documents/upload \
-F "file=@document.pdf"
Add Document from URL
curl -X POST http://localhost:8000/v1/documents/url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article.html"}'
Python Client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # API key not validated
)
response = client.chat.completions.create(
model="DocRAG-GLM-4.7",
messages=[
{"role": "user", "content": "Explain quantum computing"}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Configuration
Configure via environment variables or .env file:
| Variable | Default | Description |
|---|---|---|
HOST |
0.0.0.0 |
Server host |
PORT |
8000 |
Server port |
DEBUG |
false |
Enable debug mode |
MODEL_NAME |
DocRAG-GLM-4.7 |
Display model name |
UPSTREAM_MODEL |
glm-4.7 |
Upstream model to use |
ZAI_API_KEY / OPENROUTER_API_KEY |
(required) | API key for upstream LLM (OpenRouter) |
EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model |
VECTOR_STORE_PATH |
./data/vectors |
Vector store location |
DOCUMENTS_PATH |
./data/documents |
Document storage |
CHUNK_SIZE |
1000 |
Document chunk size |
CHUNK_OVERLAP |
200 |
Chunk overlap |
TOP_K_RESULTS |
5 |
Number of context results |
ENABLE_TOOLS |
true |
Enable tool support |
Project Structure
docrag/
├── main.py # FastAPI application entry point
├── rag/
│ ├── __init__.py # RAG system main class
│ ├── document_processor.py # Document parsing and chunking
│ ├── vector_store.py # Vector storage and search
│ └── retriever.py # Context retrieval logic
├── tools/
│ └── __init__.py # Tool management (website_downloader, etc.)
├── website-downloader.py # CLI website downloader
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
├── requirements.txt # Python dependencies
├── .env.example # Configuration template
└── README.md # This file
How It Works
Request Flow
- User sends message → OpenAI-compatible endpoint receives request
- RAG Retrieval → Query is processed and relevant context is retrieved
- Context Enhancement → Retrieved context is added to the prompt
- Tool Execution → If needed, tools are invoked (e.g., website_downloader)
- LLM Generation → GLM-4.7-Flash generates response with context
- Response → User receives response (streaming supported)
RAG Pipeline
User Query
│
▼
┌─────────────────┐
│ Query Processor │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Vector Search │ ← Knowledge Base
└────────┬────────┘
│
▼
┌─────────────────┐
│ Context Builder │
└────────┬────────┘
│
▼
┌─────────────────┐
│ GLM-4.7-Flash │
└────────┬────────┘
│
▼
Response
Supported Document Formats
- Text:
.txt,.md,.rst,.log - Documents:
.pdf,.docx - Web:
.html,.htm - Data:
.json,.yaml,.yml,.xml,.toml,.csv,.tsv - Code:
.py,.js,.ts,.java,.cpp,.c,.go,.rs,.rb,.php, etc.
Extending
Adding New Tools
# In tools/__init__.py
def my_custom_tool(param1: str, param2: int = 10) -> dict:
"""Your tool implementation."""
return {"result": "success"}
# Register the tool
tool_manager.register_tool(
name="my_custom_tool",
function=my_custom_tool,
schema={
"type": "function",
"function": {
"name": "my_custom_tool",
"description": "Description of your tool",
"parameters": {
"type": "object",
"properties": {
"param1": {"type": "string", "description": "..."},
"param2": {"type": "integer", "description": "...", "default": 10}
},
"required": ["param1"]
}
}
}
)
Using Different Vector Stores
The default implementation uses a simple file-based store. To use ChromaDB:
- Install:
pip install chromadb - Modify
rag/vector_store.pyto use ChromaDB client
Development
Running in Development Mode
DEBUG=true python main.py
Running Tests
pip install pytest pytest-asyncio
pytest tests/
License
Private repository - All rights reserved.