Go to file

Z User aa69b2f496 Add website downloader tool wrapper for GLM-4.7-Flash - Create website_downloader_tool.py with OpenAI function calling schema - Add comprehensive tool documentation - Update README with usage examples - Update requirements.txt with optional sdk dependency		2026-03-29 00:16:54 +00:00
.gitignore	Initial commit	2026-03-28 15:51:14 -07:00
README.md	Add website downloader tool wrapper for GLM-4.7-Flash	2026-03-29 00:16:54 +00:00
requirements.txt	Add website downloader tool wrapper for GLM-4.7-Flash	2026-03-29 00:16:54 +00:00
website_downloader_tool.py	Add website downloader tool wrapper for GLM-4.7-Flash	2026-03-29 00:16:54 +00:00
website-downloader.py	tool1 and init req file	2026-03-28 16:04:27 -07:00

README.md

DocRAG - Custom RAG with Document Loader

A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools.

Components

Website Downloader Tool

The website_downloader_tool.py provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk.

Features

Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.)
Rewrites links for offline viewing
Supports concurrent downloads with configurable thread count
Optional external asset downloading from CDNs
Domain whitelisting for external assets
Comprehensive error handling and statistics

Tool Schema

The tool follows the OpenAI function calling format:

from website_downloader_tool import get_tool_schema, website_downloader

# Get the tool schema for registration
schema = get_tool_schema()

Usage with GLM-4.7-Flash

from zai import ZaiClient
from website_downloader_tool import get_tool_schema, website_downloader

client = ZaiClient(api_key="your-api-key")

# Define the tool
tools = [get_tool_schema()]

# Create a chat completion with tools
response = client.chat.completions.create(
    model="glm-4.7",
    messages=[
        {
            "role": "user",
            "content": "Please download https://example.com for offline use"
        }
    ],
    tools=tools,
    stream=True,
)

# Handle tool calls in the response
for chunk in response:
    if chunk.choices[0].delta.tool_calls:
        tool_call = chunk.choices[0].delta.tool_calls[0]
        if tool_call.function.name == "website_downloader":
            import json
            args = json.loads(tool_call.function.arguments)
            result = website_downloader(**args)
            print(result)

Direct Usage

from website_downloader_tool import website_downloader

# Download a website
result = website_downloader(
    url="https://example.com",
    destination="./downloaded_site",  # Optional
    max_pages=50,                     # Max pages to crawl
    threads=6,                        # Concurrent downloads
    download_external_assets=False,   # Include CDN assets
    external_domains=["cdn.example.com"]  # Whitelist external domains
)

if result["success"]:
    print(f"Downloaded to: {result['output_directory']}")
    print(f"Pages: {result['stats']['pages_crawled']}")
    print(f"Assets: {result['stats']['assets_downloaded']}")
else:
    print(f"Error: {result['message']}")

Parameters

Parameter	Type	Required	Default	Description
`url`	string	Yes	-	Starting URL to crawl
`destination`	string	No	Derived from URL	Output folder path
`max_pages`	integer	No	50	Max HTML pages (1-1000)
`threads`	integer	No	6	Concurrent download threads (1-20)
`download_external_assets`	boolean	No	False	Download CDN assets
`external_domains`	array	No	None	Whitelist of external domains

Return Value

{
    "success": True/False,
    "message": "Human-readable summary",
    "stats": {
        "pages_crawled": int,
        "assets_downloaded": int,
        "failed_downloads": int,
        "elapsed_seconds": float,
        "output_directory": str,
        "pages": [...],       # List of downloaded pages
        "downloaded_items": [...]  # List of downloaded assets
    },
    "output_directory": "/path/to/downloaded/site"
}

Website Downloader CLI

The original website-downloader.py can still be used as a standalone CLI tool:

python website-downloader.py --url https://example.com --max-pages 50 --threads 6

CLI Options

--url: Starting URL to crawl (required)
--destination: Output folder (optional, derived from URL if not provided)
--max-pages: Maximum pages to crawl (default: 50)
--threads: Number of download threads (default: 6)
--download-external-assets: Enable external asset downloading
--external-domains: Whitelist of external domains to download from

Installation

pip install -r requirements.txt

Project Structure

docrag/
├── website-downloader.py      # Core website downloader (CLI)
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
├── requirements.txt           # Python dependencies
└── README.md                  # This file

Integration with RAG

The downloaded website content can be processed for RAG systems:

Use the tool to download website content
Parse the downloaded HTML files
Extract text content and metadata
Chunk and embed the content
Store in your vector database