Building an Intelligent Tool Search System: How Fabric's Orchestrator Finds the Right Tool in Milliseconds
Building an Intelligent Tool Search System: How Fabric's Orchestrator Finds the Right Tool in Milliseconds
The first time we connected our orchestrator to a customer's full toolkit, we hit a wall.
They had 15 MCP servers configured. GitHub. Jira. Slack. Linear. Context7 for docs. Firecrawl for scraping. Custom internal tools. The works. Every time a user asked a simple question, we were stuffing 77,000 tokens of tool definitions into context.
The response was slow. The cost was astronomical. And half the time, the LLM picked the wrong tool anyway because it was drowning in options.
We needed a smarter approach.
The Challenge: Finding a Needle in a Haystack of Tools
Modern AI applications don't just generate text—they take action. They create Jira tickets, query databases, send Slack messages, and scrape websites. But as the number of available tools grows, a critical question emerges:
How do you efficiently find the right tool among hundreds of options without burning through your token budget?
At Fabric, our orchestrator manages tools from:
- MCP Servers (Model Context Protocol) - user-configured tool providers
- Registered Agents - specialized AI agents with specific capabilities
- Integrations - Slack, GitHub, Linear, Notion, and more
- Workflows - user-defined Temporal workflows
- Built-in Tools - RAG queries, web scraping, YouTube processing
Loading all tools into context for every request would consume ~77,000 tokens. That's expensive, slow, and often unnecessary when the user just wants to "send a Slack message."
Our Solution: Hybrid BM25 + Semantic Search
We implemented a multi-stage search pipeline inspired by Anthropic's Tool Search Tool pattern. The key insight: most queries can be resolved with fast keyword matching, reserving expensive semantic search for ambiguous cases.
This pipeline achieves 85% token reduction (77K → 8.7K tokens) while maintaining intelligent tool selection.
Deep Dive: Each Stage Explained
Stage 1: Always-Available Tools
Some tools are so frequently used that they should always be considered. We maintain a small set (3-5) of "always-available" capabilities that bypass the search entirely.
const ALWAYS_AVAILABLE_CAPABILITIES = [
{
name: "workspace_rag_query",
keywords: ["document", "knowledge", "search workspace", "find in docs"],
description: "Query workspace documents and knowledge base"
},
{
name: "web_search",
keywords: ["search web", "google", "look up", "find online"],
description: "Search the internet for information"
}
];
function matchAlwaysAvailableCapabilities(query: string): ToolMatch[] {
return ALWAYS_AVAILABLE_CAPABILITIES.filter(cap =>
cap.keywords.some(kw => query.toLowerCase().includes(kw))
);
}
Why it matters: For queries like "search my documents for the Q4 report," we skip all search overhead and directly return workspace_rag_query.
Stage 2: Explicit Server Detection
Users often know exactly which tool provider they want. We parse natural language patterns to detect explicit server mentions:
function detectExplicitServerMention(query: string): string | null {
const patterns = [
/\buse\s+(\w+(?:\s+mcp)?)\s+(?:to|for)\b/i,
/\bwith\s+(\w+)\b/i,
/\bvia\s+(\w+)\b/i,
/\bfrom\s+(\w+)\s+(?:mcp|server)\b/i,
/\busing\s+(\w+)\b/i
];
for (const pattern of patterns) {
const match = query.match(pattern);
if (match) {
return normalizeServerName(match[1]);
}
}
return null;
}
Why it matters: If the user says "use context7 to look up React hooks," we connect only to the Context7 MCP server, avoiding connections to potentially dozens of other configured servers.
Stage 3: BM25 Keyword Search
BM25 (Best Matching 25) is a battle-tested ranking function used by search engines. It's deterministic, fast, and surprisingly effective for tool matching.
Our BM25 implementation extracts keywords from multiple sources:
function extractKeywords(tool: MCPTool, serverName: string): string[] {
const keywords: string[] = [];
// 1. Server name (highest priority for "use X" queries)
keywords.push(serverName.toLowerCase());
keywords.push(...serverName.split(/[-_]/).filter(Boolean));
// 2. Tool name decomposition
const toolWords = tool.name
.replace(/([a-z])([A-Z])/g, '$1 $2') // camelCase
.replace(/[_-]/g, ' ') // snake_case
.toLowerCase()
.split(' ');
keywords.push(...toolWords);
// 3. Description keywords (first 20 words, stop-word filtered)
const descWords = tool.description
?.split(/\s+/)
.slice(0, 20)
.filter(word => !STOP_WORDS.has(word.toLowerCase()))
?? [];
keywords.push(...descWords);
return [...new Set(keywords)]; // Deduplicate
}
Why it matters: BM25 resolves ~50% of queries with high confidence (>0.7), completely bypassing the more expensive semantic search.
Stage 4: Semantic Search with Qdrant
When keyword matching isn't confident enough, we fall back to semantic search using vector embeddings stored in Qdrant.
async function semanticSearch(
query: string,
userId: string,
organizationId?: string
): Promise<ToolMatch[]> {
const queryEmbedding = await generateEmbedding(query);
const results = await qdrantClient.search("tool_capabilities", {
vector: queryEmbedding,
limit: 10,
filter: {
must: [
// Multi-tenant filtering
organizationId
? { key: "organizationId", match: { value: organizationId } }
: { key: "userId", match: { value: userId } }
]
},
with_payload: true
});
return results.map(r => ({
serverName: r.payload.serverName,
toolName: r.payload.toolName,
confidence: r.score,
source: "semantic"
}));
}
Why it matters: Semantic search catches conceptually similar queries that keyword matching misses. "Help me track my tasks" might not contain "kanban" or "card," but semantic similarity connects it to project management tools.
Stage 5: Hybrid Result Merging
The final stage combines results from all sources with weighted scoring:
function mergeResults(
alwaysAvailable: ToolMatch[],
explicitServer: ToolMatch[],
keywordMatches: ToolMatch[],
semanticMatches: ToolMatch[]
): ToolMatch[] {
const merged = new Map<string, ToolMatch>();
// Priority order: always-available > explicit > keyword > semantic
const allResults = [
...alwaysAvailable.map(t => ({ ...t, confidence: 1.0 })),
...explicitServer.map(t => ({ ...t, confidence: 0.95 })),
...keywordMatches,
...semanticMatches
];
for (const tool of allResults) {
const key = `${tool.serverName}:${tool.toolName}`;
const existing = merged.get(key);
if (!existing || tool.confidence > existing.confidence) {
// Apply hybrid scoring if we have both keyword and semantic scores
if (existing && tool.source !== existing.source) {
tool.confidence = 0.6 * Math.max(tool.confidence, existing.confidence) +
0.4 * Math.min(tool.confidence, existing.confidence);
}
merged.set(key, tool);
}
}
return [...merged.values()]
.sort((a, b) => b.confidence - a.confidence)
.slice(0, 10);
}
The Complete Architecture
Here's how all the pieces fit together in our orchestrator:
Optimizations That Matter
1. Qdrant Cache Loading
Instead of connecting to MCP servers on every request, we cache tool metadata in Qdrant:
async function loadFromQdrant(
userId: string,
organizationId?: string
): Promise<ToolEntry[]> {
const results = await qdrantClient.scroll("tool_capabilities", {
filter: {
must: [
organizationId
? { key: "organizationId", match: { value: organizationId } }
: { key: "userId", match: { value: userId } }
]
},
limit: 1000,
with_payload: true
});
return results.points.map(p => ({
serverName: p.payload.serverName,
toolName: p.payload.toolName,
description: p.payload.description,
keywords: p.payload.keywords,
inputSchema: p.payload.inputSchema
}));
}
Impact: Eliminates MCP connection overhead for cached tools. A typical MCP connection takes 200-500ms; loading from Qdrant takes ~20ms.
2. Lazy MCP Loading
We defer actual tool fetching until execution time:
3. Result Reuse Across Phases
Matched tools from routing are passed directly to planning—no redundant searches:
// In analyze-and-route.ts
const matchedTools = await searchAvailableTools(query, userId, organizationId);
const routingDecision = await llmRoute(query, matchedTools);
// In create-task-plan.ts - REUSE matched tools!
async function createTaskPlan(
query: string,
routingDecision: RoutingDecision,
matchedTools: ToolMatch[] // Reused from routing!
) {
// No need to search again
const plan = await decomposeTasks(query, matchedTools);
return plan;
}
Impact: Saves ~500ms+ per request by avoiding redundant MCP connections and search operations.
4. Token Budget Enforcement
We track token usage per step and enforce limits:
interface TokenBudget {
maxTotalTokens: 100_000;
maxTokensPerStep: 16_000;
warningThreshold: 0.8; // Warn at 80%
reserveForSynthesis: 4_000;
}
function checkBudget(
currentUsage: number,
config: TokenBudget
): BudgetStatus {
const remaining = config.maxTotalTokens - currentUsage - config.reserveForSynthesis;
const usagePercent = currentUsage / config.maxTotalTokens;
return {
allowed: remaining > config.maxTokensPerStep,
remainingTokens: remaining,
usagePercentage: usagePercent,
shouldWarn: usagePercent > config.warningThreshold
};
}
Results: By the Numbers
| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Tokens per request | 77,000 | 8,700 | 85% reduction | | Average latency | 3.2s | 0.8s | 75% faster | | MCP connections | All servers | 1-2 servers | 90% fewer | | Semantic searches | Every request | ~50% of requests | 50% reduction |
Key Takeaways
-
Hybrid search beats pure semantic search. BM25 keyword matching resolves most queries faster and cheaper than embeddings.
-
User intent detection saves resources. Parsing "use X to..." patterns eliminates unnecessary server connections.
-
Caching is critical. Qdrant persistence lets us search tools without MCP connections on every request.
-
Reuse across phases. Don't search twice—pass matched results from routing to planning.
-
Lazy loading wins. Defer expensive operations (MCP connections) until absolutely necessary.
What's Next
We're exploring several improvements:
- Learning from usage patterns: Tools used together frequently should be suggested together
- Predictive pre-warming: Anticipate likely tools based on conversation context
- Federated search: Query multiple Qdrant collections in parallel
- Confidence calibration: ML-based threshold tuning for BM25 → semantic fallback
