- Create website_downloader_tool.py with OpenAI function calling schema - Add comprehensive tool documentation - Update README with usage examples - Update requirements.txt with optional sdk dependency |
||
|---|---|---|
| .gitignore | ||
| README.md | ||
| requirements.txt | ||
| website_downloader_tool.py | ||
| website-downloader.py | ||
DocRAG - Custom RAG with Document Loader
A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools.
Components
Website Downloader Tool
The website_downloader_tool.py provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk.
Features
- Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.)
- Rewrites links for offline viewing
- Supports concurrent downloads with configurable thread count
- Optional external asset downloading from CDNs
- Domain whitelisting for external assets
- Comprehensive error handling and statistics
Tool Schema
The tool follows the OpenAI function calling format:
from website_downloader_tool import get_tool_schema, website_downloader
# Get the tool schema for registration
schema = get_tool_schema()
Usage with GLM-4.7-Flash
from zai import ZaiClient
from website_downloader_tool import get_tool_schema, website_downloader
client = ZaiClient(api_key="your-api-key")
# Define the tool
tools = [get_tool_schema()]
# Create a chat completion with tools
response = client.chat.completions.create(
model="glm-4.7",
messages=[
{
"role": "user",
"content": "Please download https://example.com for offline use"
}
],
tools=tools,
stream=True,
)
# Handle tool calls in the response
for chunk in response:
if chunk.choices[0].delta.tool_calls:
tool_call = chunk.choices[0].delta.tool_calls[0]
if tool_call.function.name == "website_downloader":
import json
args = json.loads(tool_call.function.arguments)
result = website_downloader(**args)
print(result)
Direct Usage
from website_downloader_tool import website_downloader
# Download a website
result = website_downloader(
url="https://example.com",
destination="./downloaded_site", # Optional
max_pages=50, # Max pages to crawl
threads=6, # Concurrent downloads
download_external_assets=False, # Include CDN assets
external_domains=["cdn.example.com"] # Whitelist external domains
)
if result["success"]:
print(f"Downloaded to: {result['output_directory']}")
print(f"Pages: {result['stats']['pages_crawled']}")
print(f"Assets: {result['stats']['assets_downloaded']}")
else:
print(f"Error: {result['message']}")
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | Yes | - | Starting URL to crawl |
destination |
string | No | Derived from URL | Output folder path |
max_pages |
integer | No | 50 | Max HTML pages (1-1000) |
threads |
integer | No | 6 | Concurrent download threads (1-20) |
download_external_assets |
boolean | No | False | Download CDN assets |
external_domains |
array | No | None | Whitelist of external domains |
Return Value
{
"success": True/False,
"message": "Human-readable summary",
"stats": {
"pages_crawled": int,
"assets_downloaded": int,
"failed_downloads": int,
"elapsed_seconds": float,
"output_directory": str,
"pages": [...], # List of downloaded pages
"downloaded_items": [...] # List of downloaded assets
},
"output_directory": "/path/to/downloaded/site"
}
Website Downloader CLI
The original website-downloader.py can still be used as a standalone CLI tool:
python website-downloader.py --url https://example.com --max-pages 50 --threads 6
CLI Options
--url: Starting URL to crawl (required)--destination: Output folder (optional, derived from URL if not provided)--max-pages: Maximum pages to crawl (default: 50)--threads: Number of download threads (default: 6)--download-external-assets: Enable external asset downloading--external-domains: Whitelist of external domains to download from
Installation
pip install -r requirements.txt
Project Structure
docrag/
├── website-downloader.py # Core website downloader (CLI)
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
├── requirements.txt # Python dependencies
└── README.md # This file
Integration with RAG
The downloaded website content can be processed for RAG systems:
- Use the tool to download website content
- Parse the downloaded HTML files
- Extract text content and metadata
- Chunk and embed the content
- Store in your vector database
License
Private repository - All rights reserved.