docrag/README.md
Z User aa69b2f496 Add website downloader tool wrapper for GLM-4.7-Flash
- Create website_downloader_tool.py with OpenAI function calling schema
- Add comprehensive tool documentation
- Update README with usage examples
- Update requirements.txt with optional sdk dependency
2026-03-29 00:16:54 +00:00

4.8 KiB

DocRAG - Custom RAG with Document Loader

A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools.

Components

Website Downloader Tool

The website_downloader_tool.py provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk.

Features

  • Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.)
  • Rewrites links for offline viewing
  • Supports concurrent downloads with configurable thread count
  • Optional external asset downloading from CDNs
  • Domain whitelisting for external assets
  • Comprehensive error handling and statistics

Tool Schema

The tool follows the OpenAI function calling format:

from website_downloader_tool import get_tool_schema, website_downloader

# Get the tool schema for registration
schema = get_tool_schema()

Usage with GLM-4.7-Flash

from zai import ZaiClient
from website_downloader_tool import get_tool_schema, website_downloader

client = ZaiClient(api_key="your-api-key")

# Define the tool
tools = [get_tool_schema()]

# Create a chat completion with tools
response = client.chat.completions.create(
    model="glm-4.7",
    messages=[
        {
            "role": "user",
            "content": "Please download https://example.com for offline use"
        }
    ],
    tools=tools,
    stream=True,
)

# Handle tool calls in the response
for chunk in response:
    if chunk.choices[0].delta.tool_calls:
        tool_call = chunk.choices[0].delta.tool_calls[0]
        if tool_call.function.name == "website_downloader":
            import json
            args = json.loads(tool_call.function.arguments)
            result = website_downloader(**args)
            print(result)

Direct Usage

from website_downloader_tool import website_downloader

# Download a website
result = website_downloader(
    url="https://example.com",
    destination="./downloaded_site",  # Optional
    max_pages=50,                     # Max pages to crawl
    threads=6,                        # Concurrent downloads
    download_external_assets=False,   # Include CDN assets
    external_domains=["cdn.example.com"]  # Whitelist external domains
)

if result["success"]:
    print(f"Downloaded to: {result['output_directory']}")
    print(f"Pages: {result['stats']['pages_crawled']}")
    print(f"Assets: {result['stats']['assets_downloaded']}")
else:
    print(f"Error: {result['message']}")

Parameters

Parameter Type Required Default Description
url string Yes - Starting URL to crawl
destination string No Derived from URL Output folder path
max_pages integer No 50 Max HTML pages (1-1000)
threads integer No 6 Concurrent download threads (1-20)
download_external_assets boolean No False Download CDN assets
external_domains array No None Whitelist of external domains

Return Value

{
    "success": True/False,
    "message": "Human-readable summary",
    "stats": {
        "pages_crawled": int,
        "assets_downloaded": int,
        "failed_downloads": int,
        "elapsed_seconds": float,
        "output_directory": str,
        "pages": [...],       # List of downloaded pages
        "downloaded_items": [...]  # List of downloaded assets
    },
    "output_directory": "/path/to/downloaded/site"
}

Website Downloader CLI

The original website-downloader.py can still be used as a standalone CLI tool:

python website-downloader.py --url https://example.com --max-pages 50 --threads 6

CLI Options

  • --url: Starting URL to crawl (required)
  • --destination: Output folder (optional, derived from URL if not provided)
  • --max-pages: Maximum pages to crawl (default: 50)
  • --threads: Number of download threads (default: 6)
  • --download-external-assets: Enable external asset downloading
  • --external-domains: Whitelist of external domains to download from

Installation

pip install -r requirements.txt

Project Structure

docrag/
├── website-downloader.py      # Core website downloader (CLI)
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
├── requirements.txt           # Python dependencies
└── README.md                  # This file

Integration with RAG

The downloaded website content can be processed for RAG systems:

  1. Use the tool to download website content
  2. Parse the downloaded HTML files
  3. Extract text content and metadata
  4. Chunk and embed the content
  5. Store in your vector database

License

Private repository - All rights reserved.