- Create website_downloader_tool.py with OpenAI function calling schema - Add comprehensive tool documentation - Update README with usage examples - Update requirements.txt with optional sdk dependency
165 lines
4.8 KiB
Markdown
165 lines
4.8 KiB
Markdown
# DocRAG - Custom RAG with Document Loader
|
|
|
|
A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools.
|
|
|
|
## Components
|
|
|
|
### Website Downloader Tool
|
|
|
|
The `website_downloader_tool.py` provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk.
|
|
|
|
#### Features
|
|
|
|
- Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.)
|
|
- Rewrites links for offline viewing
|
|
- Supports concurrent downloads with configurable thread count
|
|
- Optional external asset downloading from CDNs
|
|
- Domain whitelisting for external assets
|
|
- Comprehensive error handling and statistics
|
|
|
|
#### Tool Schema
|
|
|
|
The tool follows the OpenAI function calling format:
|
|
|
|
```python
|
|
from website_downloader_tool import get_tool_schema, website_downloader
|
|
|
|
# Get the tool schema for registration
|
|
schema = get_tool_schema()
|
|
```
|
|
|
|
#### Usage with GLM-4.7-Flash
|
|
|
|
```python
|
|
from zai import ZaiClient
|
|
from website_downloader_tool import get_tool_schema, website_downloader
|
|
|
|
client = ZaiClient(api_key="your-api-key")
|
|
|
|
# Define the tool
|
|
tools = [get_tool_schema()]
|
|
|
|
# Create a chat completion with tools
|
|
response = client.chat.completions.create(
|
|
model="glm-4.7",
|
|
messages=[
|
|
{
|
|
"role": "user",
|
|
"content": "Please download https://example.com for offline use"
|
|
}
|
|
],
|
|
tools=tools,
|
|
stream=True,
|
|
)
|
|
|
|
# Handle tool calls in the response
|
|
for chunk in response:
|
|
if chunk.choices[0].delta.tool_calls:
|
|
tool_call = chunk.choices[0].delta.tool_calls[0]
|
|
if tool_call.function.name == "website_downloader":
|
|
import json
|
|
args = json.loads(tool_call.function.arguments)
|
|
result = website_downloader(**args)
|
|
print(result)
|
|
```
|
|
|
|
#### Direct Usage
|
|
|
|
```python
|
|
from website_downloader_tool import website_downloader
|
|
|
|
# Download a website
|
|
result = website_downloader(
|
|
url="https://example.com",
|
|
destination="./downloaded_site", # Optional
|
|
max_pages=50, # Max pages to crawl
|
|
threads=6, # Concurrent downloads
|
|
download_external_assets=False, # Include CDN assets
|
|
external_domains=["cdn.example.com"] # Whitelist external domains
|
|
)
|
|
|
|
if result["success"]:
|
|
print(f"Downloaded to: {result['output_directory']}")
|
|
print(f"Pages: {result['stats']['pages_crawled']}")
|
|
print(f"Assets: {result['stats']['assets_downloaded']}")
|
|
else:
|
|
print(f"Error: {result['message']}")
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
| Parameter | Type | Required | Default | Description |
|
|
|-----------|------|----------|---------|-------------|
|
|
| `url` | string | Yes | - | Starting URL to crawl |
|
|
| `destination` | string | No | Derived from URL | Output folder path |
|
|
| `max_pages` | integer | No | 50 | Max HTML pages (1-1000) |
|
|
| `threads` | integer | No | 6 | Concurrent download threads (1-20) |
|
|
| `download_external_assets` | boolean | No | False | Download CDN assets |
|
|
| `external_domains` | array | No | None | Whitelist of external domains |
|
|
|
|
#### Return Value
|
|
|
|
```python
|
|
{
|
|
"success": True/False,
|
|
"message": "Human-readable summary",
|
|
"stats": {
|
|
"pages_crawled": int,
|
|
"assets_downloaded": int,
|
|
"failed_downloads": int,
|
|
"elapsed_seconds": float,
|
|
"output_directory": str,
|
|
"pages": [...], # List of downloaded pages
|
|
"downloaded_items": [...] # List of downloaded assets
|
|
},
|
|
"output_directory": "/path/to/downloaded/site"
|
|
}
|
|
```
|
|
|
|
### Website Downloader CLI
|
|
|
|
The original `website-downloader.py` can still be used as a standalone CLI tool:
|
|
|
|
```bash
|
|
python website-downloader.py --url https://example.com --max-pages 50 --threads 6
|
|
```
|
|
|
|
#### CLI Options
|
|
|
|
- `--url`: Starting URL to crawl (required)
|
|
- `--destination`: Output folder (optional, derived from URL if not provided)
|
|
- `--max-pages`: Maximum pages to crawl (default: 50)
|
|
- `--threads`: Number of download threads (default: 6)
|
|
- `--download-external-assets`: Enable external asset downloading
|
|
- `--external-domains`: Whitelist of external domains to download from
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
docrag/
|
|
├── website-downloader.py # Core website downloader (CLI)
|
|
├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
|
|
├── requirements.txt # Python dependencies
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Integration with RAG
|
|
|
|
The downloaded website content can be processed for RAG systems:
|
|
|
|
1. Use the tool to download website content
|
|
2. Parse the downloaded HTML files
|
|
3. Extract text content and metadata
|
|
4. Chunk and embed the content
|
|
5. Store in your vector database
|
|
|
|
## License
|
|
|
|
Private repository - All rights reserved.
|