No LLM needed for tool selection. Flow is now:
Request → run ALL tools in parallel → results into system prompt → 1 LLM call
- _run_all_tools: fires every tool concurrently (30s timeout each)
- No required args: run with schema defaults
- Query-like required args (query, topic, title, etc): use user message
- Specific args (symbol, url, pmid): skip (can't guess)
- _build_tool_results_text: formats all results into system prompt
- build_enhanced_messages: system prompt now has real-time data section
- call_llm: dead simple, just prompt → response (replaces generate_response)
- Removed: generate_response, _parse_tool_calls, _clean_tool_syntax,
_build_tool_descriptions (all dead code now)
- Streaming path: same flow, runs tools then streams the LLM response
- Both streaming and non-streaming use identical tool pipeline
Three fixes for the 'I apologize, couldnt generate a response' bug:
1. Safety net: if _clean_tool_syntax strips ALL content (e.g. the LLM
output only the JSON tool call block and nothing else), return the
original content instead of the useless error message.
2. Detailed logging: now logs the first 300 chars of every LLM response
so we can see exactly what the model outputs. Also logs which parse
pattern matched and which tool names were found.
3. Desperate fallback parser (Pattern 4): if none of the regex/brace
patterns match, tries to json.loads() the entire content and looks
for known tool names. Catches LLMs that output the array directly
or use slightly different formatting.
The upstream LLM only supports 2 native tool calls per response, but
the user needs to fire many tools at once. Solution: content-based
'mega tool call' where the LLM bundles ALL tool calls into a single
JSON array in its response text.
Key changes:
- System prompt: tells LLM to output {tool_calls: [...]} array
with ALL needed tools in one block (no native tools param)
- _parse_tool_calls: parses the tool_calls array format (with legacy
tool_call single-object fallback)
- generate_response: NO tools/tool_choice params to API, pure
content-based parsing
- generate_response: executes ALL tools concurrently via asyncio.gather
- generate_response: feeds ALL results back in one consolidated message
- _clean_tool_syntax: strips both tool_calls and tool_call blocks
Problems fixed:
- 'Mega tool call': LLM outputting multiple tool calls that got bundled
into one. Now uses native OpenAI tools parameter which handles multiple
tool calls properly via message.tool_calls array.
- 'Returning nothing': _clean_tool_syntax was too aggressive, stripping
the entire response. Now only strips code-fence-wrapped blocks.
- Tool results were appended to system message growing it unboundedly;
now uses proper 'tool' role messages in conversation history.
Key changes:
- generate_response: passes tools/tool_choice to OpenAI API (native
tool calling), with retry without tool_choice for unsupported models
- generate_response: handles multiple tool_calls per response natively
- generate_response: uses proper 'tool' role for results instead of
appending to system message
- _parse_tool_calls (was _parse_tool_call): now returns a list, supports
multiple tool calls, used as fallback for models without native tools
- _clean_tool_syntax: much less aggressive, only strips code-fence
blocks, no longer removes bare JSON (was eating valid responses)
- System prompt: removed JSON format instructions (native tools handles
format), simplified rules
Instead of passing tools to the OpenRouter API (limited to 10 tools):
- Tool descriptions are now embedded in the system prompt
- LLM outputs tool calls as JSON: {"tool_call": {"name": "...", "arguments": {...}}}
- We parse the response, execute tools, and feed results back
- Supports all 33 tools without hitting the API limit
Changes:
- Added _build_tool_descriptions() for tool docs in prompt
- Added _parse_tool_call() to extract tool requests from LLM output
- Added _clean_tool_syntax() to remove tool JSON from responses
- Rewrote generate_response() for context-based approach
- Updated system prompt with tool usage instructions
- Skip website download for Open WebUI automated tasks (title, tags, follow-ups)
- Check if site already downloaded before re-downloading
- Return cached site info if previously downloaded
- Reduces unnecessary network calls and processing time
- Pass all registered tools to LLM during chat completion
- Handle tool_calls from LLM response
- Execute tools and feed results back to LLM
- Loop until LLM returns final response
- Updated system prompt to encourage tool use
- Updated streaming to handle tool calls
- Increased MAX_TOOL_ITERATIONS to 5
Key changes:
- Add URL extraction and detection functions
- Download websites BEFORE RAG retrieval (not after)
- Expand trigger keywords to include common phrases like 'go to', 'headlines', etc.
- Update system prompt to tell LLM it CAN access websites
- Improve streaming response handling
Now when user asks 'go to orovillemr.com and give me the headlines':
1. System detects URL and access intent
2. Downloads and ingests website content
3. RAG retrieves relevant content
4. LLM generates response with actual website content
Features:
- RAG system now uses website_downloader_tool as primary content ingestion method
- download_and_ingest_website() method for complete website processing
- Stores page pointers (source_url, page_url, local_path) in vector store
- Site registry tracks all downloaded websites with metadata
- New API endpoints for website management:
- POST /v1/documents/website - Download and ingest a website
- GET /v1/documents/sites - List all downloaded sites
- GET /v1/documents/sites/{url} - Get site info
- DELETE /v1/documents/sites/{url} - Delete a site and its content
Changes:
- rag/__init__.py: Added download_and_ingest_website(), site registry
- rag/document_processor.py: Added extract_text_from_html() public method
- rag/vector_store.py: Added delete_by_source_url(), get_stats()
- main.py: New website endpoints, integrated tool with RAG system
Features:
- FastAPI server with OpenAI-compatible endpoints (/v1/chat/completions, /v1/models)
- RAG system with document processing and vector storage
- Support for multiple document formats (PDF, DOCX, HTML, text, code)
- Streaming response support
- Tool integration with website_downloader
- Document management API endpoints
- GLM-4.7-Flash integration via z-ai-web-dev-sdk
- Works transparently with Open WebUI and other OpenAI clients
Components:
- main.py: FastAPI application with OpenAI-compatible API
- rag/: RAG system (document processor, vector store, retriever)
- tools/: Tool manager with website_downloader integration
- .env.example: Configuration template