docrag/README.md

# DocRAG - Custom RAG with Document Loader

A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools.

## Components

### Website Downloader Tool

The `website_downloader_tool.py` provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk.

#### Features

- Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.)
- Rewrites links for offline viewing
- Supports concurrent downloads with configurable thread count
- Optional external asset downloading from CDNs
- Domain whitelisting for external assets
- Comprehensive error handling and statistics

#### Tool Schema

The tool follows the OpenAI function calling format:

```python
from website_downloader_tool import get_tool_schema, website_downloader

# Get the tool schema for registration
schema = get_tool_schema()
```

#### Usage with GLM-4.7-Flash

```python
from zai import ZaiClient
from website_downloader_tool import get_tool_schema, website_downloader

client = ZaiClient(api_key="your-api-key")

# Define the tool
tools = [get_tool_schema()]

# Create a chat completion with tools
response = client.chat.completions.create(
    model="glm-4.7",
    messages=[
        {
            "role": "user",
            "content": "Please download https://example.com for offline use"
        }
    ],
    tools=tools,
    stream=True,
)

# Handle tool calls in the response
for chunk in response:
    if chunk.choices[0].delta.tool_calls:
        tool_call = chunk.choices[0].delta.tool_calls[0]
        if tool_call.function.name == "website_downloader":
            import json
            args = json.loads(tool_call.function.arguments)
            result = website_downloader(**args)
            print(result)
```

#### Direct Usage

```python
from website_downloader_tool import website_downloader

# Download a website
result = website_downloader(
    url="https://example.com",
    destination="./downloaded_site",  # Optional
    max_pages=50,                     # Max pages to crawl
    threads=6,                        # Concurrent downloads
    download_external_assets=False,   # Include CDN assets
    external_domains=["cdn.example.com"]  # Whitelist external domains
)

if result["success"]:
    print(f"Downloaded to: {result['output_directory']}")
    print(f"Pages: {result['stats']['pages_crawled']}")
    print(f"Assets: {result['stats']['assets_downloaded']}")
else:
    print(f"Error: {result['message']}")
```

#### Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `url` | string | Yes | - | Starting URL to crawl |
| `destination` | string | No | Derived from URL | Output folder path |
| `max_pages` | integer | No | 50 | Max HTML pages (1-1000) |
| `threads` | integer | No | 6 | Concurrent download threads (1-20) |
| `download_external_assets` | boolean | No | False | Download CDN assets |
| `external_domains` | array | No | None | Whitelist of external domains |

#### Return Value

```python
{
    "success": True/False,
    "message": "Human-readable summary",
    "stats": {
        "pages_crawled": int,
        "assets_downloaded": int,
        "failed_downloads": int,
        "elapsed_seconds": float,
        "output_directory": str,
        "pages": [...],       # List of downloaded pages
        "downloaded_items": [...]  # List of downloaded assets
    },
    "output_directory": "/path/to/downloaded/site"
}
```

### Website Downloader CLI

The original `website-downloader.py` can still be used as a standalone CLI tool:

```bash
python website-downloader.py --url https://example.com --max-pages 50 --threads 6
```

#### CLI Options

- `--url`: Starting URL to crawl (required)
- `--destination`: Output folder (optional, derived from URL if not provided)
- `--max-pages`: Maximum pages to crawl (default: 50)
- `--threads`: Number of download threads (default: 6)
- `--download-external-assets`: Enable external asset downloading
- `--external-domains`: Whitelist of external domains to download from

## Installation

```bash
pip install -r requirements.txt
```

## Project Structure

```
docrag/
├── website-downloader.py      # Core website downloader (CLI)
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
├── requirements.txt           # Python dependencies
└── README.md                  # This file
```

## Integration with RAG

The downloaded website content can be processed for RAG systems:

1. Use the tool to download website content
2. Parse the downloaded HTML files
3. Extract text content and metadata
4. Chunk and embed the content
5. Store in your vector database

## License

Private repository - All rights reserved.