Building an Intelligent Tool Search System: How Fabric's Orchestrator Finds the Right Tool in Milliseconds

January 15, 2026

Share:
Building an Intelligent Tool Search System: How Fabric's Orchestrator Finds the Right Tool in Milliseconds

Building an Intelligent Tool Search System: How Fabric's Orchestrator Finds the Right Tool in Milliseconds

The first time we connected our orchestrator to a customer's full toolkit, we hit a wall.

They had 15 MCP servers configured. GitHub. Jira. Slack. Linear. Context7 for docs. Firecrawl for scraping. Custom internal tools. The works. Every time a user asked a simple question, we were stuffing 77,000 tokens of tool definitions into context.

The response was slow. The cost was astronomical. And half the time, the LLM picked the wrong tool anyway because it was drowning in options.

We needed a smarter approach.


The Challenge: Finding a Needle in a Haystack of Tools

Modern AI applications don't just generate text—they take action. They create Jira tickets, query databases, send Slack messages, and scrape websites. But as the number of available tools grows, a critical question emerges:

How do you efficiently find the right tool among hundreds of options without burning through your token budget?

At Fabric, our orchestrator manages tools from:

  • MCP Servers (Model Context Protocol) - user-configured tool providers
  • Registered Agents - specialized AI agents with specific capabilities
  • Integrations - Slack, GitHub, Linear, Notion, and more
  • Workflows - user-defined Temporal workflows
  • Built-in Tools - RAG queries, web scraping, YouTube processing

Loading all tools into context for every request would consume ~77,000 tokens. That's expensive, slow, and often unnecessary when the user just wants to "send a Slack message."


We implemented a multi-stage search pipeline inspired by Anthropic's Tool Search Tool pattern. The key insight: most queries can be resolved with fast keyword matching, reserving expensive semantic search for ambiguous cases.

Loading diagram...

This pipeline achieves 85% token reduction (77K → 8.7K tokens) while maintaining intelligent tool selection.


Deep Dive: Each Stage Explained

Stage 1: Always-Available Tools

Some tools are so frequently used that they should always be considered. We maintain a small set (3-5) of "always-available" capabilities that bypass the search entirely.

const ALWAYS_AVAILABLE_CAPABILITIES = [
  {
    name: "workspace_rag_query",
    keywords: ["document", "knowledge", "search workspace", "find in docs"],
    description: "Query workspace documents and knowledge base"
  },
  {
    name: "web_search",
    keywords: ["search web", "google", "look up", "find online"],
    description: "Search the internet for information"
  }
];

function matchAlwaysAvailableCapabilities(query: string): ToolMatch[] {
  return ALWAYS_AVAILABLE_CAPABILITIES.filter(cap =>
    cap.keywords.some(kw => query.toLowerCase().includes(kw))
  );
}

Why it matters: For queries like "search my documents for the Q4 report," we skip all search overhead and directly return workspace_rag_query.


Stage 2: Explicit Server Detection

Users often know exactly which tool provider they want. We parse natural language patterns to detect explicit server mentions:

Loading diagram...
function detectExplicitServerMention(query: string): string | null {
  const patterns = [
    /\buse\s+(\w+(?:\s+mcp)?)\s+(?:to|for)\b/i,
    /\bwith\s+(\w+)\b/i,
    /\bvia\s+(\w+)\b/i,
    /\bfrom\s+(\w+)\s+(?:mcp|server)\b/i,
    /\busing\s+(\w+)\b/i
  ];

  for (const pattern of patterns) {
    const match = query.match(pattern);
    if (match) {
      return normalizeServerName(match[1]);
    }
  }
  return null;
}

Why it matters: If the user says "use context7 to look up React hooks," we connect only to the Context7 MCP server, avoiding connections to potentially dozens of other configured servers.


BM25 (Best Matching 25) is a battle-tested ranking function used by search engines. It's deterministic, fast, and surprisingly effective for tool matching.

Loading diagram...

Our BM25 implementation extracts keywords from multiple sources:

function extractKeywords(tool: MCPTool, serverName: string): string[] {
  const keywords: string[] = [];

  // 1. Server name (highest priority for "use X" queries)
  keywords.push(serverName.toLowerCase());
  keywords.push(...serverName.split(/[-_]/).filter(Boolean));

  // 2. Tool name decomposition
  const toolWords = tool.name
    .replace(/([a-z])([A-Z])/g, '$1 $2')  // camelCase
    .replace(/[_-]/g, ' ')                  // snake_case
    .toLowerCase()
    .split(' ');
  keywords.push(...toolWords);

  // 3. Description keywords (first 20 words, stop-word filtered)
  const descWords = tool.description
    ?.split(/\s+/)
    .slice(0, 20)
    .filter(word => !STOP_WORDS.has(word.toLowerCase()))
    ?? [];
  keywords.push(...descWords);

  return [...new Set(keywords)];  // Deduplicate
}

Why it matters: BM25 resolves ~50% of queries with high confidence (>0.7), completely bypassing the more expensive semantic search.


Stage 4: Semantic Search with Qdrant

When keyword matching isn't confident enough, we fall back to semantic search using vector embeddings stored in Qdrant.

Loading diagram...
async function semanticSearch(
  query: string,
  userId: string,
  organizationId?: string
): Promise<ToolMatch[]> {
  const queryEmbedding = await generateEmbedding(query);

  const results = await qdrantClient.search("tool_capabilities", {
    vector: queryEmbedding,
    limit: 10,
    filter: {
      must: [
        // Multi-tenant filtering
        organizationId
          ? { key: "organizationId", match: { value: organizationId } }
          : { key: "userId", match: { value: userId } }
      ]
    },
    with_payload: true
  });

  return results.map(r => ({
    serverName: r.payload.serverName,
    toolName: r.payload.toolName,
    confidence: r.score,
    source: "semantic"
  }));
}

Why it matters: Semantic search catches conceptually similar queries that keyword matching misses. "Help me track my tasks" might not contain "kanban" or "card," but semantic similarity connects it to project management tools.


Stage 5: Hybrid Result Merging

The final stage combines results from all sources with weighted scoring:

Loading diagram...
function mergeResults(
  alwaysAvailable: ToolMatch[],
  explicitServer: ToolMatch[],
  keywordMatches: ToolMatch[],
  semanticMatches: ToolMatch[]
): ToolMatch[] {
  const merged = new Map<string, ToolMatch>();

  // Priority order: always-available > explicit > keyword > semantic
  const allResults = [
    ...alwaysAvailable.map(t => ({ ...t, confidence: 1.0 })),
    ...explicitServer.map(t => ({ ...t, confidence: 0.95 })),
    ...keywordMatches,
    ...semanticMatches
  ];

  for (const tool of allResults) {
    const key = `${tool.serverName}:${tool.toolName}`;
    const existing = merged.get(key);

    if (!existing || tool.confidence > existing.confidence) {
      // Apply hybrid scoring if we have both keyword and semantic scores
      if (existing && tool.source !== existing.source) {
        tool.confidence = 0.6 * Math.max(tool.confidence, existing.confidence) +
                          0.4 * Math.min(tool.confidence, existing.confidence);
      }
      merged.set(key, tool);
    }
  }

  return [...merged.values()]
    .sort((a, b) => b.confidence - a.confidence)
    .slice(0, 10);
}

The Complete Architecture

Here's how all the pieces fit together in our orchestrator:

Loading diagram...

Optimizations That Matter

1. Qdrant Cache Loading

Instead of connecting to MCP servers on every request, we cache tool metadata in Qdrant:

async function loadFromQdrant(
  userId: string,
  organizationId?: string
): Promise<ToolEntry[]> {
  const results = await qdrantClient.scroll("tool_capabilities", {
    filter: {
      must: [
        organizationId
          ? { key: "organizationId", match: { value: organizationId } }
          : { key: "userId", match: { value: userId } }
      ]
    },
    limit: 1000,
    with_payload: true
  });

  return results.points.map(p => ({
    serverName: p.payload.serverName,
    toolName: p.payload.toolName,
    description: p.payload.description,
    keywords: p.payload.keywords,
    inputSchema: p.payload.inputSchema
  }));
}

Impact: Eliminates MCP connection overhead for cached tools. A typical MCP connection takes 200-500ms; loading from Qdrant takes ~20ms.

2. Lazy MCP Loading

We defer actual tool fetching until execution time:

Loading diagram...

3. Result Reuse Across Phases

Matched tools from routing are passed directly to planning—no redundant searches:

// In analyze-and-route.ts
const matchedTools = await searchAvailableTools(query, userId, organizationId);
const routingDecision = await llmRoute(query, matchedTools);

// In create-task-plan.ts - REUSE matched tools!
async function createTaskPlan(
  query: string,
  routingDecision: RoutingDecision,
  matchedTools: ToolMatch[]  // Reused from routing!
) {
  // No need to search again
  const plan = await decomposeTasks(query, matchedTools);
  return plan;
}

Impact: Saves ~500ms+ per request by avoiding redundant MCP connections and search operations.

4. Token Budget Enforcement

We track token usage per step and enforce limits:

interface TokenBudget {
  maxTotalTokens: 100_000;
  maxTokensPerStep: 16_000;
  warningThreshold: 0.8;  // Warn at 80%
  reserveForSynthesis: 4_000;
}

function checkBudget(
  currentUsage: number,
  config: TokenBudget
): BudgetStatus {
  const remaining = config.maxTotalTokens - currentUsage - config.reserveForSynthesis;
  const usagePercent = currentUsage / config.maxTotalTokens;

  return {
    allowed: remaining > config.maxTokensPerStep,
    remainingTokens: remaining,
    usagePercentage: usagePercent,
    shouldWarn: usagePercent > config.warningThreshold
  };
}

Results: By the Numbers

| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Tokens per request | 77,000 | 8,700 | 85% reduction | | Average latency | 3.2s | 0.8s | 75% faster | | MCP connections | All servers | 1-2 servers | 90% fewer | | Semantic searches | Every request | ~50% of requests | 50% reduction |


Key Takeaways

  1. Hybrid search beats pure semantic search. BM25 keyword matching resolves most queries faster and cheaper than embeddings.

  2. User intent detection saves resources. Parsing "use X to..." patterns eliminates unnecessary server connections.

  3. Caching is critical. Qdrant persistence lets us search tools without MCP connections on every request.

  4. Reuse across phases. Don't search twice—pass matched results from routing to planning.

  5. Lazy loading wins. Defer expensive operations (MCP connections) until absolutely necessary.


What's Next

We're exploring several improvements:

  • Learning from usage patterns: Tools used together frequently should be suggested together
  • Predictive pre-warming: Anticipate likely tools based on conversation context
  • Federated search: Query multiple Qdrant collections in parallel
  • Confidence calibration: ML-based threshold tuning for BM25 → semantic fallback