# DocRAG - Custom RAG with Document Loader A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools. ## Components ### Website Downloader Tool The `website_downloader_tool.py` provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk. #### Features - Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.) - Rewrites links for offline viewing - Supports concurrent downloads with configurable thread count - Optional external asset downloading from CDNs - Domain whitelisting for external assets - Comprehensive error handling and statistics #### Tool Schema The tool follows the OpenAI function calling format: ```python from website_downloader_tool import get_tool_schema, website_downloader # Get the tool schema for registration schema = get_tool_schema() ``` #### Usage with GLM-4.7-Flash ```python from zai import ZaiClient from website_downloader_tool import get_tool_schema, website_downloader client = ZaiClient(api_key="your-api-key") # Define the tool tools = [get_tool_schema()] # Create a chat completion with tools response = client.chat.completions.create( model="glm-4.7", messages=[ { "role": "user", "content": "Please download https://example.com for offline use" } ], tools=tools, stream=True, ) # Handle tool calls in the response for chunk in response: if chunk.choices[0].delta.tool_calls: tool_call = chunk.choices[0].delta.tool_calls[0] if tool_call.function.name == "website_downloader": import json args = json.loads(tool_call.function.arguments) result = website_downloader(**args) print(result) ``` #### Direct Usage ```python from website_downloader_tool import website_downloader # Download a website result = website_downloader( url="https://example.com", destination="./downloaded_site", # Optional max_pages=50, # Max pages to crawl threads=6, # Concurrent downloads download_external_assets=False, # Include CDN assets external_domains=["cdn.example.com"] # Whitelist external domains ) if result["success"]: print(f"Downloaded to: {result['output_directory']}") print(f"Pages: {result['stats']['pages_crawled']}") print(f"Assets: {result['stats']['assets_downloaded']}") else: print(f"Error: {result['message']}") ``` #### Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `url` | string | Yes | - | Starting URL to crawl | | `destination` | string | No | Derived from URL | Output folder path | | `max_pages` | integer | No | 50 | Max HTML pages (1-1000) | | `threads` | integer | No | 6 | Concurrent download threads (1-20) | | `download_external_assets` | boolean | No | False | Download CDN assets | | `external_domains` | array | No | None | Whitelist of external domains | #### Return Value ```python { "success": True/False, "message": "Human-readable summary", "stats": { "pages_crawled": int, "assets_downloaded": int, "failed_downloads": int, "elapsed_seconds": float, "output_directory": str, "pages": [...], # List of downloaded pages "downloaded_items": [...] # List of downloaded assets }, "output_directory": "/path/to/downloaded/site" } ``` ### Website Downloader CLI The original `website-downloader.py` can still be used as a standalone CLI tool: ```bash python website-downloader.py --url https://example.com --max-pages 50 --threads 6 ``` #### CLI Options - `--url`: Starting URL to crawl (required) - `--destination`: Output folder (optional, derived from URL if not provided) - `--max-pages`: Maximum pages to crawl (default: 50) - `--threads`: Number of download threads (default: 6) - `--download-external-assets`: Enable external asset downloading - `--external-domains`: Whitelist of external domains to download from ## Installation ```bash pip install -r requirements.txt ``` ## Project Structure ``` docrag/ ├── website-downloader.py # Core website downloader (CLI) ├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash ├── requirements.txt # Python dependencies └── README.md # This file ``` ## Integration with RAG The downloaded website content can be processed for RAG systems: 1. Use the tool to download website content 2. Parse the downloaded HTML files 3. Extract text content and metadata 4. Chunk and embed the content 5. Store in your vector database ## License Private repository - All rights reserved.