Go to file

Z User 8a46a78a4e Fix: add robust parsing, logging, and safety net for empty responses Three fixes for the 'I apologize, couldnt generate a response' bug: 1. Safety net: if _clean_tool_syntax strips ALL content (e.g. the LLM output only the JSON tool call block and nothing else), return the original content instead of the useless error message. 2. Detailed logging: now logs the first 300 chars of every LLM response so we can see exactly what the model outputs. Also logs which parse pattern matched and which tool names were found. 3. Desperate fallback parser (Pattern 4): if none of the regex/brace patterns match, tries to json.loads() the entire content and looks for known tool names. Catches LLMs that output the array directly or use slightly different formatting.		2026-03-29 18:11:43 +00:00
rag	Fix tool call parsing, improve embeddings, and fix async issues	2026-03-29 17:49:32 +00:00
tools	Implement tool calling loop for LLM	2026-03-29 16:07:56 +00:00
.gitignore	Implement tool calling loop for LLM	2026-03-29 16:07:56 +00:00
main.py	Fix: add robust parsing, logging, and safety net for empty responses	2026-03-29 18:11:43 +00:00
README.md	Fix tool call parsing, improve embeddings, and fix async issues	2026-03-29 17:49:32 +00:00
requirements.txt	Implement tool calling loop for LLM	2026-03-29 16:07:56 +00:00
tools.md	Implement tool calling loop for LLM	2026-03-29 16:07:56 +00:00
website_downloader_tool.py	Implement tool calling loop for LLM	2026-03-29 16:07:56 +00:00
website_downloader.py	Implement tool calling loop for LLM	2026-03-29 16:07:56 +00:00

README.md

DocRAG - OpenAI-Compatible RAG Server

A custom RAG (Retrieval-Augmented Generation) system that appears as a standard OpenAI API server to clients like Open WebUI. Behind the scenes, it:

Processes user queries through a RAG system
Retrieves relevant context from a knowledge base
Passes the enriched context to GLM-4.7-Flash for response generation
Optionally uses tools like website_downloader for enhanced capabilities

Users interact with what appears to be a normal chat experience, while sophisticated RAG operations happen transparently in the background.

Features

OpenAI-Compatible API: Works with any OpenAI client (Open WebUI, custom apps, etc.)
RAG Integration: Automatic context retrieval for enhanced responses
Document Management: Upload and manage documents in the knowledge base
Tool Support: Built-in tools like website_downloader for extended capabilities
Streaming Support: Real-time streaming responses
Easy Configuration: Environment-based configuration

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env and add your ZAI_API_KEY

3. Run the Server

python main.py

The server will start on http://0.0.0.0:8000

4. Use with Open WebUI

Open Open WebUI settings
Add a new OpenAI-compatible connection
Set the base URL to http://your-server:8000/v1
Leave the API key empty or use any value (not validated)
Select the "DocRAG-GLM-4.7" model

API Endpoints

OpenAI-Compatible Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions (streaming supported)
`/v1/models`	GET	List available models
`/v1/models/{model_id}`	GET	Get model information

Document Management Endpoints

Endpoint	Method	Description
`/v1/documents`	GET	List documents in knowledge base
`/v1/documents/upload`	POST	Upload a document
`/v1/documents/url`	POST	Add document from URL
`/v1/documents/{doc_id}`	DELETE	Delete a document

Health & Status

Endpoint	Method	Description
`/health`	GET	Health check
`/`	GET	API information

Usage Examples

Chat Completion

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DocRAG-GLM-4.7",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ],
    "stream": false
  }'

Upload Document

curl -X POST http://localhost:8000/v1/documents/upload \
  -F "file=@document.pdf"

Add Document from URL

curl -X POST http://localhost:8000/v1/documents/url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article.html"}'

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # API key not validated
)

response = client.chat.completions.create(
    model="DocRAG-GLM-4.7",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Configuration

Configure via environment variables or .env file:

Variable	Default	Description
`HOST`	`0.0.0.0`	Server host
`PORT`	`8000`	Server port
`DEBUG`	`false`	Enable debug mode
`MODEL_NAME`	`DocRAG-GLM-4.7`	Display model name
`UPSTREAM_MODEL`	`glm-4.7`	Upstream model to use
`ZAI_API_KEY` / `OPENROUTER_API_KEY`	(required)	API key for upstream LLM (OpenRouter)
`EMBEDDING_MODEL`	`text-embedding-3-small`	Embedding model
`VECTOR_STORE_PATH`	`./data/vectors`	Vector store location
`DOCUMENTS_PATH`	`./data/documents`	Document storage
`CHUNK_SIZE`	`1000`	Document chunk size
`CHUNK_OVERLAP`	`200`	Chunk overlap
`TOP_K_RESULTS`	`5`	Number of context results
`ENABLE_TOOLS`	`true`	Enable tool support

Project Structure

docrag/
├── main.py                    # FastAPI application entry point
├── rag/
│   ├── __init__.py           # RAG system main class
│   ├── document_processor.py # Document parsing and chunking
│   ├── vector_store.py       # Vector storage and search
│   └── retriever.py          # Context retrieval logic
├── tools/
│   └── __init__.py           # Tool management (website_downloader, etc.)
├── website-downloader.py     # CLI website downloader
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
├── requirements.txt          # Python dependencies
├── .env.example              # Configuration template
└── README.md                 # This file

How It Works

Request Flow

User sends message → OpenAI-compatible endpoint receives request
RAG Retrieval → Query is processed and relevant context is retrieved
Context Enhancement → Retrieved context is added to the prompt
Tool Execution → If needed, tools are invoked (e.g., website_downloader)
LLM Generation → GLM-4.7-Flash generates response with context
Response → User receives response (streaming supported)

RAG Pipeline

User Query
    │
    ▼
┌─────────────────┐
│ Query Processor │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Vector Search   │ ← Knowledge Base
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Context Builder │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  GLM-4.7-Flash  │
└────────┬────────┘
         │
         ▼
     Response

Supported Document Formats

Text: .txt, .md, .rst, .log
Documents: .pdf, .docx
Web: .html, .htm
Data: .json, .yaml, .yml, .xml, .toml, .csv, .tsv
Code: .py, .js, .ts, .java, .cpp, .c, .go, .rs, .rb, .php, etc.

Extending

Adding New Tools

# In tools/__init__.py

def my_custom_tool(param1: str, param2: int = 10) -> dict:
    """Your tool implementation."""
    return {"result": "success"}

# Register the tool
tool_manager.register_tool(
    name="my_custom_tool",
    function=my_custom_tool,
    schema={
        "type": "function",
        "function": {
            "name": "my_custom_tool",
            "description": "Description of your tool",
            "parameters": {
                "type": "object",
                "properties": {
                    "param1": {"type": "string", "description": "..."},
                    "param2": {"type": "integer", "description": "...", "default": 10}
                },
                "required": ["param1"]
            }
        }
    }
)

Using Different Vector Stores

The default implementation uses a simple file-based store. To use ChromaDB:

Install: pip install chromadb
Modify rag/vector_store.py to use ChromaDB client

Development

Running in Development Mode

DEBUG=true python main.py

Running Tests

pip install pytest pytest-asyncio
pytest tests/