docrag/README.md
Z User aa69b2f496 Add website downloader tool wrapper for GLM-4.7-Flash
- Create website_downloader_tool.py with OpenAI function calling schema
- Add comprehensive tool documentation
- Update README with usage examples
- Update requirements.txt with optional sdk dependency
2026-03-29 00:16:54 +00:00

165 lines
4.8 KiB
Markdown

# DocRAG - Custom RAG with Document Loader
A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools.
## Components
### Website Downloader Tool
The `website_downloader_tool.py` provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk.
#### Features
- Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.)
- Rewrites links for offline viewing
- Supports concurrent downloads with configurable thread count
- Optional external asset downloading from CDNs
- Domain whitelisting for external assets
- Comprehensive error handling and statistics
#### Tool Schema
The tool follows the OpenAI function calling format:
```python
from website_downloader_tool import get_tool_schema, website_downloader
# Get the tool schema for registration
schema = get_tool_schema()
```
#### Usage with GLM-4.7-Flash
```python
from zai import ZaiClient
from website_downloader_tool import get_tool_schema, website_downloader
client = ZaiClient(api_key="your-api-key")
# Define the tool
tools = [get_tool_schema()]
# Create a chat completion with tools
response = client.chat.completions.create(
model="glm-4.7",
messages=[
{
"role": "user",
"content": "Please download https://example.com for offline use"
}
],
tools=tools,
stream=True,
)
# Handle tool calls in the response
for chunk in response:
if chunk.choices[0].delta.tool_calls:
tool_call = chunk.choices[0].delta.tool_calls[0]
if tool_call.function.name == "website_downloader":
import json
args = json.loads(tool_call.function.arguments)
result = website_downloader(**args)
print(result)
```
#### Direct Usage
```python
from website_downloader_tool import website_downloader
# Download a website
result = website_downloader(
url="https://example.com",
destination="./downloaded_site", # Optional
max_pages=50, # Max pages to crawl
threads=6, # Concurrent downloads
download_external_assets=False, # Include CDN assets
external_domains=["cdn.example.com"] # Whitelist external domains
)
if result["success"]:
print(f"Downloaded to: {result['output_directory']}")
print(f"Pages: {result['stats']['pages_crawled']}")
print(f"Assets: {result['stats']['assets_downloaded']}")
else:
print(f"Error: {result['message']}")
```
#### Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `url` | string | Yes | - | Starting URL to crawl |
| `destination` | string | No | Derived from URL | Output folder path |
| `max_pages` | integer | No | 50 | Max HTML pages (1-1000) |
| `threads` | integer | No | 6 | Concurrent download threads (1-20) |
| `download_external_assets` | boolean | No | False | Download CDN assets |
| `external_domains` | array | No | None | Whitelist of external domains |
#### Return Value
```python
{
"success": True/False,
"message": "Human-readable summary",
"stats": {
"pages_crawled": int,
"assets_downloaded": int,
"failed_downloads": int,
"elapsed_seconds": float,
"output_directory": str,
"pages": [...], # List of downloaded pages
"downloaded_items": [...] # List of downloaded assets
},
"output_directory": "/path/to/downloaded/site"
}
```
### Website Downloader CLI
The original `website-downloader.py` can still be used as a standalone CLI tool:
```bash
python website-downloader.py --url https://example.com --max-pages 50 --threads 6
```
#### CLI Options
- `--url`: Starting URL to crawl (required)
- `--destination`: Output folder (optional, derived from URL if not provided)
- `--max-pages`: Maximum pages to crawl (default: 50)
- `--threads`: Number of download threads (default: 6)
- `--download-external-assets`: Enable external asset downloading
- `--external-domains`: Whitelist of external domains to download from
## Installation
```bash
pip install -r requirements.txt
```
## Project Structure
```
docrag/
├── website-downloader.py # Core website downloader (CLI)
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
├── requirements.txt # Python dependencies
└── README.md # This file
```
## Integration with RAG
The downloaded website content can be processed for RAG systems:
1. Use the tool to download website content
2. Parse the downloaded HTML files
3. Extract text content and metadata
4. Chunk and embed the content
5. Store in your vector database
## License
Private repository - All rights reserved.