Add website downloader tool wrapper for GLM-4.7-Flash

- Create website_downloader_tool.py with OpenAI function calling schema - Add comprehensive tool documentation - Update README with usage examples - Update requirements.txt with optional sdk dependency
2026-03-29 00:16:54 +00:00 · 2026-03-29 00:16:54 +00:00 · aa69b2f496
commit aa69b2f496
parent 1623ee8d2c
3 changed files with 739 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,2 +1,164 @@
-# docrag
+# DocRAG - Custom RAG with Document Loader

+A custom RAG (Retrieval-Augmented Generation) system with a custom document loader that acts as a local OpenAI-compatible server using a remote LLM with custom tools.
+
+## Components
+
+### Website Downloader Tool
+
+The `website_downloader_tool.py` provides a tool interface for downloading and mirroring websites for offline use or RAG ingestion. It can be used by GLM-4.7-Flash via the z-ai-web-dev-sdk.
+
+#### Features
+
+- Downloads HTML pages and all linked assets (CSS, JS, images, fonts, etc.)
+- Rewrites links for offline viewing
+- Supports concurrent downloads with configurable thread count
+- Optional external asset downloading from CDNs
+- Domain whitelisting for external assets
+- Comprehensive error handling and statistics
+
+#### Tool Schema
+
+The tool follows the OpenAI function calling format:
+
+```python
+from website_downloader_tool import get_tool_schema, website_downloader
+
+# Get the tool schema for registration
+schema = get_tool_schema()
+```
+
+#### Usage with GLM-4.7-Flash
+
+```python
+from zai import ZaiClient
+from website_downloader_tool import get_tool_schema, website_downloader
+
+client = ZaiClient(api_key="your-api-key")
+
+# Define the tool
+tools = [get_tool_schema()]
+
+# Create a chat completion with tools
+response = client.chat.completions.create(
+    model="glm-4.7",
+    messages=[
+        {
+            "role": "user",
+            "content": "Please download https://example.com for offline use"
+        }
+    ],
+    tools=tools,
+    stream=True,
+)
+
+# Handle tool calls in the response
+for chunk in response:
+    if chunk.choices[0].delta.tool_calls:
+        tool_call = chunk.choices[0].delta.tool_calls[0]
+        if tool_call.function.name == "website_downloader":
+            import json
+            args = json.loads(tool_call.function.arguments)
+            result = website_downloader(**args)
+            print(result)
+```
+
+#### Direct Usage
+
+```python
+from website_downloader_tool import website_downloader
+
+# Download a website
+result = website_downloader(
+    url="https://example.com",
+    destination="./downloaded_site",  # Optional
+    max_pages=50,                     # Max pages to crawl
+    threads=6,                        # Concurrent downloads
+    download_external_assets=False,   # Include CDN assets
+    external_domains=["cdn.example.com"]  # Whitelist external domains
+)
+
+if result["success"]:
+    print(f"Downloaded to: {result['output_directory']}")
+    print(f"Pages: {result['stats']['pages_crawled']}")
+    print(f"Assets: {result['stats']['assets_downloaded']}")
+else:
+    print(f"Error: {result['message']}")
+```
+
+#### Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `url` | string | Yes | - | Starting URL to crawl |
+| `destination` | string | No | Derived from URL | Output folder path |
+| `max_pages` | integer | No | 50 | Max HTML pages (1-1000) |
+| `threads` | integer | No | 6 | Concurrent download threads (1-20) |
+| `download_external_assets` | boolean | No | False | Download CDN assets |
+| `external_domains` | array | No | None | Whitelist of external domains |
+
+#### Return Value
+
+```python
+{
+    "success": True/False,
+    "message": "Human-readable summary",
+    "stats": {
+        "pages_crawled": int,
+        "assets_downloaded": int,
+        "failed_downloads": int,
+        "elapsed_seconds": float,
+        "output_directory": str,
+        "pages": [...],       # List of downloaded pages
+        "downloaded_items": [...]  # List of downloaded assets
+    },
+    "output_directory": "/path/to/downloaded/site"
+}
+```
+
+### Website Downloader CLI
+
+The original `website-downloader.py` can still be used as a standalone CLI tool:
+
+```bash
+python website-downloader.py --url https://example.com --max-pages 50 --threads 6
+```
+
+#### CLI Options
+
+- `--url`: Starting URL to crawl (required)
+- `--destination`: Output folder (optional, derived from URL if not provided)
+- `--max-pages`: Maximum pages to crawl (default: 50)
+- `--threads`: Number of download threads (default: 6)
+- `--download-external-assets`: Enable external asset downloading
+- `--external-domains`: Whitelist of external domains to download from
+
+## Installation
+
+```bash
+pip install -r requirements.txt
+```
+
+## Project Structure
+
+```
+docrag/
+├── website-downloader.py      # Core website downloader (CLI)
+├── website_downloader_tool.py # Tool wrapper for GLM-4.7-Flash
+├── requirements.txt           # Python dependencies
+└── README.md                  # This file
+```
+
+## Integration with RAG
+
+The downloaded website content can be processed for RAG systems:
+
+1. Use the tool to download website content
+2. Parse the downloaded HTML files
+3. Extract text content and metadata
+4. Chunk and embed the content
+5. Store in your vector database
+
+## License
+
+Private repository - All rights reserved.
--- a/requirements.txt
+++ b/requirements.txt
@ -1,4 +1,7 @@
 requests~=2.32.4
 beautifulsoup4~=4.13.4
 wget~=3.2
-urllib3~=2.5.0
+urllib3~=2.5.0
+
+# Optional: For using z-ai-web-dev-sdk with GLM-4.7-Flash
+# z-ai-web-dev-sdk>=1.0.0
--- a/website_downloader_tool.py
+++ b/website_downloader_tool.py
@ -0,0 +1,572 @@
+#!/usr/bin/env python3
+"""
+Website Downloader Tool for GLM-4.7-Flash
+
+This module provides a tool interface for the website-downloader functionality,
+allowing it to be used as a function/tool by the GLM-4.7-Flash model via the
+z-ai-web-dev-sdk.
+
+Usage:
+    The tool can be invoked by the LLM to download and mirror websites for
+    offline use or for ingesting into a RAG system.
+"""
+
+from __future__ import annotations
+
+import logging
+import queue
+import threading
+import time
+from pathlib import Path
+from typing import Any, Optional
+from urllib.parse import urlparse
+
+# Import the core functionality from website_downloader
+from website_downloader import (
+    SESSION,
+    TIMEOUT,
+    ASSET_EXTENSIONS,
+    CSS_URL_RE,
+    _canonical_netloc,
+    _protocol_fix,
+    canonicalize_url,
+    create_dir,
+    extract_css_assets,
+    fetch_binary,
+    is_httpish,
+    is_internal,
+    is_non_fetchable,
+    is_allowed_external,
+    normalize_url,
+    rewrite_links,
+    safe_write_text,
+    to_local_path,
+    to_local_asset_path,
+    cdn_local_path,
+    fetch_html,
+)
+
+# Configure logging for tool use
+log = logging.getLogger(__name__)
+
+
+# =============================================================================
+# Tool Schema Definition
+# =============================================================================
+
+TOOL_SCHEMA = {
+    "type": "function",
+    "function": {
+        "name": "website_downloader",
+        "description": (
+            "Download and mirror a website for offline use or RAG ingestion. "
+            "This tool crawls a website starting from a given URL, downloads HTML pages "
+            "and all linked assets (CSS, JavaScript, images, fonts, etc.), and saves them "
+            "locally with rewritten links for offline viewing. "
+            "Use this tool when you need to: "
+            "1) Archive a website for offline access, "
+            "2) Download website content for analysis or RAG systems, "
+            "3) Create a local mirror of a website."
+        ),
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "url": {
+                    "type": "string",
+                    "description": (
+                        "The starting URL to crawl (e.g., 'https://example.com/'). "
+                        "Must be a valid HTTP or HTTPS URL."
+                    ),
+                },
+                "destination": {
+                    "type": "string",
+                    "description": (
+                        "Optional output folder path where the downloaded website "
+                        "will be saved. If not provided, a folder name will be derived "
+                        "from the URL's domain (e.g., 'example_com')."
+                    ),
+                    "default": None,
+                },
+                "max_pages": {
+                    "type": "integer",
+                    "description": (
+                        "Maximum number of HTML pages to crawl. "
+                        "Use lower values for quick downloads, higher for comprehensive archiving."
+                    ),
+                    "default": 50,
+                    "minimum": 1,
+                    "maximum": 1000,
+                },
+                "threads": {
+                    "type": "integer",
+                    "description": (
+                        "Number of concurrent download threads. "
+                        "Higher values can speed up downloads but may trigger rate limits."
+                    ),
+                    "default": 6,
+                    "minimum": 1,
+                    "maximum": 20,
+                },
+                "download_external_assets": {
+                    "type": "boolean",
+                    "description": (
+                        "Whether to download assets from external domains (CDNs, etc.). "
+                        "Enable for complete offline functionality, disable for faster downloads "
+                        "of only same-domain content."
+                    ),
+                    "default": False,
+                },
+                "external_domains": {
+                    "type": "array",
+                    "items": {"type": "string"},
+                    "description": (
+                        "Optional list of external domain names to allow downloading from. "
+                        "Useful for whitelisting specific CDN domains. "
+                        "Example: ['cdn.example.com', 'assets.example.com']"
+                    ),
+                    "default": None,
+                },
+            },
+            "required": ["url"],
+        },
+    }
+}
+
+
+# =============================================================================
+# Tool Implementation
+# =============================================================================
+
+def crawl_site_tool(
+    start_url: str,
+    root: Path,
+    max_pages: int,
+    threads: int,
+    download_external_assets: bool = False,
+    external_domains: Optional[set[str]] = None,
+) -> dict[str, Any]:
+    """
+    Internal crawl implementation that returns detailed results.
+
+    This is a modified version of crawl_site that collects statistics
+    and returns them in a structured format for the tool response.
+
+    Returns:
+        Dictionary containing crawl statistics and results
+    """
+    start_time = time.time()
+
+    # Statistics tracking
+    stats = {
+        "pages_crawled": 0,
+        "assets_downloaded": 0,
+        "failed_downloads": 0,
+        "pages": [],
+        "assets": [],
+        "errors": [],
+    }
+
+    q_pages: queue.Queue[str] = queue.Queue()
+    q_pages.put(start_url)
+
+    seen_pages: set[str] = set()
+    queued_pages: set[str] = {start_url}
+    queued_assets: set[str] = set()
+    download_q: queue.Queue[tuple[str, Path]] = queue.Queue()
+
+    root_netloc = _canonical_netloc(urlparse(start_url))
+
+    # Track successfully downloaded items
+    downloaded_items: list[dict[str, str]] = []
+    failed_items: list[dict[str, str]] = []
+
+    def worker() -> None:
+        """Download worker thread."""
+        while True:
+            url, dest = download_q.get()
+            try:
+                if is_non_fetchable(url) or not is_httpish(url):
+                    log.debug("Skip non-fetchable: %s", url)
+                    continue
+
+                if dest.exists():
+                    stats["assets_downloaded"] += 1
+                    continue
+
+                try:
+                    fetch_binary(
+                        url,
+                        dest,
+                        download_q,
+                        site_root=root,
+                        root_netloc=root_netloc,
+                        download_external_assets=download_external_assets,
+                        external_domains=external_domains,
+                    )
+                    if dest.exists():
+                        stats["assets_downloaded"] += 1
+                        downloaded_items.append({
+                            "url": url,
+                            "local_path": str(dest.relative_to(root))
+                        })
+                except Exception as e:
+                    stats["failed_downloads"] += 1
+                    failed_items.append({
+                        "url": url,
+                        "error": str(e)
+                    })
+                    log.debug("Failed to download %s: %s", url, e)
+            finally:
+                download_q.task_done()
+
+    # Spawn worker threads
+    worker_threads = []
+    for i in range(max(1, threads)):
+        t = threading.Thread(target=worker, name=f"DL-{i + 1}", daemon=True)
+        t.start()
+        worker_threads.append(t)
+
+    # Main crawl loop
+    while not q_pages.empty() and len(seen_pages) < max_pages:
+        page_url = q_pages.get()
+
+        if page_url in seen_pages:
+            continue
+
+        seen_pages.add(page_url)
+        stats["pages_crawled"] += 1
+
+        log.info("Crawling page %d/%d: %s", len(seen_pages), max_pages, page_url)
+
+        soup = fetch_html(page_url)
+        if soup is None:
+            stats["errors"].append(f"Failed to fetch page: {page_url}")
+            continue
+
+        # Record page info
+        local_page_path = to_local_path(urlparse(page_url), root)
+        stats["pages"].append({
+            "url": page_url,
+            "local_path": str(local_page_path.relative_to(root)) if local_page_path.is_relative_to(root) else str(local_page_path)
+        })
+
+        # Find and queue all assets
+        for tag in soup.find_all(True):
+            # Handle various tag types and their URL attributes
+            tag_handlers = {
+                "img": ["src", "data-src", "srcset"],
+                "script": ["src"],
+                "link": ["href"],
+                "video": ["src", "poster"],
+                "audio": ["src"],
+                "source": ["src", "srcset"],
+                "iframe": ["src"],
+                "embed": ["src"],
+                "object": ["data"],
+            }
+
+            attrs_to_check = tag_handlers.get(tag.name, [])
+
+            # Also check for link tags with resource rel types
+            if tag.name == "link":
+                rel = tag.get("rel", [])
+                if isinstance(rel, str):
+                    rel = [rel]
+                rel_set = {r.lower() for r in rel}
+                resource_rels = {"stylesheet", "icon", "shortcut", "apple-touch-icon", "preload", "modulepreload", "manifest"}
+                if not rel_set & resource_rels:
+                    attrs_to_check = []
+
+            for attr in attrs_to_check:
+                if not tag.has_attr(attr):
+                    continue
+
+                if attr == "srcset":
+                    # Handle srcset specially
+                    for entry in str(tag["srcset"]).split(","):
+                        parts = entry.strip().split()
+                        if not parts:
+                            continue
+                        url_part = _protocol_fix(parts[0], page_url)
+                        process_asset_url(
+                            url_part, page_url, root, root_netloc,
+                            download_external_assets, external_domains,
+                            queued_assets, download_q, stats
+                        )
+                else:
+                    url_part = _protocol_fix(str(tag.get(attr, "")), page_url)
+                    process_asset_url(
+                        url_part, page_url, root, root_netloc,
+                        download_external_assets, external_domains,
+                        queued_assets, download_q, stats
+                    )
+
+            # Handle inline styles
+            if tag.has_attr("style"):
+                style = str(tag["style"])
+                for match in CSS_URL_RE.findall(style):
+                    url_part = _protocol_fix(match.strip().strip("'\""), page_url)
+                    process_asset_url(
+                        url_part, page_url, root, root_netloc,
+                        download_external_assets, external_domains,
+                        queued_assets, download_q, stats
+                    )
+
+            # Handle <style> blocks
+            if tag.name == "style":
+                css_text = tag.string or tag.get_text()
+                if css_text:
+                    for asset in extract_css_assets(css_text):
+                        asset = _protocol_fix(asset, page_url)
+                        process_asset_url(
+                            asset, page_url, root, root_netloc,
+                            download_external_assets, external_domains,
+                            queued_assets, download_q, stats
+                        )
+
+            # Find and queue internal links for further crawling
+            if tag.name == "a" and tag.has_attr("href"):
+                href = _protocol_fix(str(tag.get("href", "")), page_url)
+                if href and not href.startswith("#") and is_httpish(href) and not is_non_fetchable(href):
+                    abs_url = normalize_url(canonicalize_url(href, page_url))
+                    if is_internal(abs_url, root_netloc) and abs_url not in seen_pages and abs_url not in queued_pages:
+                        queued_pages.add(abs_url)
+                        q_pages.put(abs_url)
+
+        # Save the page with rewritten links
+        local_path = to_local_path(urlparse(page_url), root)
+        create_dir(local_path.parent)
+        rewrite_links(
+            soup,
+            page_url,
+            root,
+            local_path.parent,
+            download_external_assets,
+            external_domains,
+        )
+        safe_write_text(local_path, str(soup), encoding="utf-8")
+
+    # Wait for all downloads to complete
+    download_q.join()
+
+    elapsed = time.time() - start_time
+    stats["elapsed_seconds"] = round(elapsed, 2)
+    stats["output_directory"] = str(root.resolve())
+    stats["downloaded_items"] = downloaded_items[:100]  # Limit for response size
+    stats["failed_items"] = failed_items[:50]  # Limit for response size
+
+    return stats
+
+
+def process_asset_url(
+    url_part: str,
+    page_url: str,
+    root: Path,
+    root_netloc: str,
+    download_external_assets: bool,
+    external_domains: Optional[set[str]],
+    queued_assets: set[str],
+    download_q: queue.Queue[tuple[str, Path]],
+    stats: dict,
+) -> None:
+    """Process and queue an asset URL for download."""
+    if (
+        not url_part
+        or url_part.startswith("#")
+        or url_part.startswith(("data:", "javascript:", "about:"))
+        or is_non_fetchable(url_part)
+        or not is_httpish(url_part)
+    ):
+        return
+
+    abs_url = normalize_url(canonicalize_url(url_part, page_url))
+    parsed = urlparse(abs_url)
+
+    if not parsed.path.lower().endswith(ASSET_EXTENSIONS):
+        return
+
+    is_ext = not is_internal(abs_url, root_netloc)
+
+    if is_ext:
+        if not download_external_assets:
+            return
+        if external_domains and not is_allowed_external(abs_url, external_domains):
+            return
+        dest_path = cdn_local_path(parsed, root)
+    else:
+        dest_path = to_local_asset_path(parsed, root)
+
+    if abs_url not in queued_assets:
+        queued_assets.add(abs_url)
+        create_dir(dest_path.parent)
+        download_q.put((abs_url, dest_path))
+
+
+def make_root(url: str, custom: Optional[str]) -> Path:
+    """Derive output folder from URL if custom not supplied."""
+    return Path(custom) if custom else Path(urlparse(url).netloc.replace(".", "_"))
+
+
+def website_downloader(
+    url: str,
+    destination: Optional[str] = None,
+    max_pages: int = 50,
+    threads: int = 6,
+    download_external_assets: bool = False,
+    external_domains: Optional[list[str]] = None,
+) -> dict[str, Any]:
+    """
+    Download and mirror a website for offline use or RAG ingestion.
+
+    This is the main tool function that can be invoked by the GLM-4.7-Flash model.
+    It wraps the website-downloader functionality in a tool interface.
+
+    Args:
+        url: The starting URL to crawl (e.g., 'https://example.com/')
+        destination: Optional output folder path. If not provided, derived from URL domain.
+        max_pages: Maximum number of HTML pages to crawl (1-1000, default: 50)
+        threads: Number of concurrent download threads (1-20, default: 6)
+        download_external_assets: Whether to download assets from external domains (default: False)
+        external_domains: Optional list of external domain names to allow downloading from
+
+    Returns:
+        Dictionary containing:
+        - success: Boolean indicating if the operation was successful
+        - message: Human-readable summary of what was done
+        - stats: Detailed statistics about the crawl
+        - output_directory: Path to the downloaded website
+    """
+    try:
+        # Validate URL
+        parsed_url = urlparse(url)
+        if not parsed_url.scheme or parsed_url.scheme not in ("http", "https"):
+            return {
+                "success": False,
+                "message": f"Invalid URL: '{url}'. Must be a valid HTTP or HTTPS URL.",
+                "stats": None,
+                "output_directory": None,
+            }
+
+        # Validate parameters
+        if max_pages < 1 or max_pages > 1000:
+            return {
+                "success": False,
+                "message": f"max_pages must be between 1 and 1000, got {max_pages}",
+                "stats": None,
+                "output_directory": None,
+            }
+
+        if threads < 1 or threads > 20:
+            return {
+                "success": False,
+                "message": f"threads must be between 1 and 20, got {threads}",
+                "stats": None,
+                "output_directory": None,
+            }
+
+        # Prepare output directory
+        root = make_root(url, destination)
+
+        # Process external domains
+        ext_domains_set = None
+        if external_domains:
+            ext_domains_set = {
+                urlparse(d).hostname.lower() if "://" in d else d.lower()
+                for d in external_domains
+            }
+            download_external_assets = True  # Auto-enable if domains specified
+
+        # Log the crawl start
+        log.info(
+            "Starting website download: url=%s, dest=%s, max_pages=%d, threads=%d, external=%s",
+            url, root, max_pages, threads, download_external_assets
+        )
+
+        # Run the crawl
+        stats = crawl_site_tool(
+            start_url=url,
+            root=root,
+            max_pages=max_pages,
+            threads=threads,
+            download_external_assets=download_external_assets,
+            external_domains=ext_domains_set,
+        )
+
+        # Build success response
+        message = (
+            f"Successfully downloaded website from {url}\n"
+            f"- Pages crawled: {stats['pages_crawled']}\n"
+            f"- Assets downloaded: {stats['assets_downloaded']}\n"
+            f"- Time elapsed: {stats['elapsed_seconds']}s\n"
+            f"- Output directory: {stats['output_directory']}"
+        )
+
+        if stats["failed_downloads"] > 0:
+            message += f"\n- Failed downloads: {stats['failed_downloads']}"
+
+        return {
+            "success": True,
+            "message": message,
+            "stats": stats,
+            "output_directory": stats["output_directory"],
+        }
+
+    except Exception as e:
+        log.exception("Website download failed")
+        return {
+            "success": False,
+            "message": f"Website download failed: {str(e)}",
+            "stats": None,
+            "output_directory": None,
+        }
+
+
+# =============================================================================
+# Tool Registration Helper
+# =============================================================================
+
+def get_tool_schema() -> dict[str, Any]:
+    """
+    Get the tool schema for registration with the LLM.
+
+    This schema follows the OpenAI function calling format and can be
+    used directly when creating chat completions with tools.
+
+    Returns:
+        The tool schema dictionary
+    """
+    return TOOL_SCHEMA
+
+
+def get_tool_function():
+    """
+    Get the tool function for invocation.
+
+    Returns:
+        The callable tool function
+    """
+    return website_downloader
+
+
+# =============================================================================
+# Example Usage
+# =============================================================================
+
+if __name__ == "__main__":
+    # Example: Direct invocation
+    import json
+
+    print("Website Downloader Tool for GLM-4.7-Flash")
+    print("=" * 50)
+    print("\nTool Schema:")
+    print(json.dumps(TOOL_SCHEMA, indent=2))
+
+    print("\n" + "=" * 50)
+    print("\nExample invocation:")
+    result = website_downloader(
+        url="https://example.com",
+        max_pages=5,
+        threads=4,
+        download_external_assets=False
+    )
+    print(json.dumps(result, indent=2))