Deep Dive: How Fabric AI's RAG Architecture Unlocks Enterprise Knowledge

In our introduction to Fabric AI, we discussed how the platform compresses the software development lifecycle through intelligent automation. Today, we're diving deep into the technical foundation that makes this possible: our Retrieval-Augmented Generation (RAG) architecture.

RAG is more than a buzzword—it's the critical technology that enables AI agents to work with your specific organizational context, standards, and accumulated knowledge. But implementing RAG at enterprise scale requires solving challenges that most off-the-shelf solutions ignore:

How do you extract clean, structured text from messy real-world documents?
How do you chunk documents to preserve semantic meaning?
How do you retrieve the right context without overwhelming the LLM?
How do you ensure multi-tenant data isolation in vector databases?
How do you make the entire pipeline fault-tolerant and observable?

Let's explore how Fabric AI solves each of these challenges.

The RAG Problem: Context is Everything

Modern large language models (LLMs) are incredibly capable, but they have a fundamental limitation: they don't know about your organization's specific data, processes, or standards. Ask GPT-4 to generate a PRD, and you'll get a generic template. Ask it to write API documentation, and it won't know your naming conventions or architectural patterns.

The naive solution is to fine-tune a model on your data. But fine-tuning is:

Expensive: Thousands of dollars per training run
Slow: Days or weeks to complete
Static: Models become outdated as soon as your documentation changes
Opaque: Hard to understand what the model "learned"
Risky: Training data leakage concerns with sensitive enterprise data

RAG provides a better approach: Keep the base model unchanged, but dynamically inject relevant context into each request. This way, the model applies its general reasoning abilities to your specific organizational knowledge.

Architecture Overview: Five Stages of RAG

Fabric AI's RAG pipeline consists of five interconnected stages, each addressing a specific technical challenge:

Loading diagram...

Let's examine each stage in detail.

Stage 1: Multi-Format Document Extraction

Enterprises don't store knowledge in neat Markdown files. Real-world data comes in:

Scanned PDFs with complex layouts
PowerPoint presentations with embedded diagrams
Word documents with tables and images
Confluence pages with nested content
Legacy formats (older Office versions, proprietary formats)
Code repositories with inline documentation

Each format requires specialized extraction logic. Fabric AI solves this with a multi-extractor strategy.

The Extraction Factory Pattern

We've implemented an abstraction layer that routes documents to the appropriate extractor based on:

File format: MIME type and extension detection
Content complexity: Layout analysis, table detection
Cost constraints: User or organization budget limits
Extraction strategy: Speed vs. accuracy tradeoffs

Loading diagram...

Extractor Selection Logic

Unstructured.io (Priority 1):

Best for: Complex layouts, academic papers, mixed-format documents
Pros: Excellent table extraction, understands document structure
Cons: Higher cost, slower processing
Use when: Accuracy is critical, cost is not a constraint

LlamaParse (Priority 2):

Best for: Technical documentation, code comments, markdown-heavy content
Pros: Optimized for developer content, preserves code formatting
Cons: Less effective with visual layouts
Use when: Processing code repositories or technical specs

Azure Document Intelligence (Priority 3):

Best for: Enterprise compliance requirements, forms, invoices
Pros: SOC 2 compliant, excellent OCR, layout understanding
Cons: Requires Azure subscription, regional availability
Use when: Enterprise governance is required

AWS Textract (Priority 4):

Best for: High-volume form processing, structured documents
Pros: Fast, cost-effective at scale, good table extraction
Cons: Less effective with unstructured content
Use when: Processing large batches of similar documents

Hybrid PDF Extractor (Fallback):

Best for: Simple PDFs, local processing
Combines: pdf-parse for text + Tesseract OCR for images
Use when: All other extractors fail or are unavailable

Cost-Aware Extraction

Organizations can configure extraction strategies:

1// Local-only strategy (free, fast, lower accuracy)
2strategy: "local-only"
3
4// Hybrid strategy (fallback to cloud if local fails)
5strategy: "hybrid"
6
7// Cloud-first strategy (best accuracy, higher cost)
8strategy: "cloud-first"
9
10// Budget-aware (automatic provider selection based on cost)
11strategy: "budget-aware"
12maxCost: 0.10 // Maximum cost per document in USD
13

The system tracks usage per user and organization, preventing runaway costs while maximizing extraction quality.

Stage 2: Semantic Chunking

Once text is extracted, we face a new challenge: how do we split it into manageable pieces for embedding and retrieval?

Naive approaches fail:

Fixed-size chunks (e.g., every 500 tokens) break semantic units mid-sentence
Paragraph-based chunks create uneven sizes and lose cross-paragraph context
Section-based chunks can be too large for embedding models (which have token limits)

The Semantic Chunking Algorithm

Fabric AI implements recursive semantic chunking that:

Respects document structure: Preserves headings, lists, code blocks
Maintains semantic coherence: Never splits mid-sentence or mid-thought
Creates overlapping windows: Chunks share context to prevent information loss
Adapts to content type: Different strategies for code vs. prose vs. tables

Loading diagram...

Chunk Metadata Enrichment

Each chunk is enriched with metadata for better retrieval:

1interface SemanticChunk {
2  text: string;                 // The actual chunk text
3  tokens: number;               // Token count
4  
5  // Position metadata
6  documentId: string;
7  chunkIndex: number;           // Position in document
8  totalChunks: number;
9  
10  // Structural metadata
11  headingPath: string[];        // ["Introduction", "Architecture", "RAG Pipeline"]
12  sectionType: string;          // "heading", "paragraph", "code", "list", "table"
13  depth: number;                // Heading depth (h1=1, h2=2, etc.)
14  
15  // Semantic metadata
16  keywords: string[];           // Extracted keywords
17  entities: string[];           // Named entities (people, orgs, tech)
18  language: string;             // Detected language
19  
20  // Context metadata
21  previousChunk?: string;       // ID of previous chunk
22  nextChunk?: string;           // ID of next chunk
23  relatedChunks: string[];      // Cross-references
24  
25  // Tenant metadata
26  organizationId: string;
27  projectId?: string;
28  tags: string[];
29}
30

This rich metadata enables sophisticated filtering during retrieval.

Stage 3: Vector Embedding Generation

With semantically coherent chunks, we generate vector embeddings—high-dimensional numerical representations that capture semantic meaning.

Embedding Model Selection

Fabric AI supports multiple embedding providers:

| Provider | Model | Dimensions | Max Tokens | Use Case | |----------|-------|------------|------------|----------| | OpenAI | text-embedding-3-large | 3072 | 8191 | Best overall quality, expensive | | OpenAI | text-embedding-3-small | 1536 | 8191 | Good balance of cost/quality | | Cohere | embed-multilingual-v3.0 | 1024 | 512 | Multilingual support | | Azure OpenAI | text-embedding-ada-002 | 1536 | 8191 | Enterprise compliance | | Custom | (user-provided) | Variable | Variable | On-premise requirements |

The choice of embedding model is critical and affects:

Retrieval quality: Better embeddings = more relevant results
Cost: Embeddings are generated once but retrieved many times
Latency: Larger embedding dimensions increase search time
Storage: Higher dimensions require more vector database storage

Batch Processing for Efficiency

Embedding generation is optimized for throughput:

1// Naive approach: One API call per chunk (slow, expensive)
2for (const chunk of chunks) {
3  const embedding = await embedChunk(chunk);
4  await storeEmbedding(embedding);
5}
6
7// Optimized approach: Batch processing
8const BATCH_SIZE = 100; // Provider-dependent
9for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
10  const batch = chunks.slice(i, i + BATCH_SIZE);
11  const embeddings = await embedBatch(batch); // Single API call
12  await storeBatch(embeddings); // Batch database write
13}
14

This reduces:

API calls: 100x fewer calls for 10,000 chunks
Latency: Parallel processing within batches
Cost: Batch pricing discounts from providers

Embedding Quality Validation

Not all embeddings are equal. We validate quality through:

Magnitude check: Ensure vectors are normalized (unit vectors)
Similarity sanity tests: Related chunks should have high cosine similarity
Outlier detection: Flag chunks with unusual embedding patterns
Cross-validation: Verify retrieval quality on sample queries

Poor-quality embeddings are flagged for re-processing or manual review.

Stage 4: Vector Storage with Qdrant

With embeddings generated, we need fast, scalable storage with sophisticated filtering capabilities. Enter Qdrant—a vector database built specifically for production RAG systems.

Why Qdrant Over Alternatives?

We evaluated several vector databases:

| Feature | Qdrant | Pinecone | Weaviate | Milvus | |---------|--------|----------|----------|--------| | Performance | Excellent | Excellent | Good | Excellent | | Multi-tenancy | Native filtering | Namespace-based | GraphQL filters | Partitions | | Metadata filtering | Rich (nested JSON) | Limited | Rich | Limited | | Self-hosted | Yes | No | Yes | Yes | | Managed option | Yes (Qdrant Cloud) | Yes | Yes | Yes (Zilliz) | | Hybrid search | Yes (dense + sparse) | No | Yes | Yes | | Payload size | Unlimited | 40KB limit | Large | Large |

Qdrant wins for enterprise RAG because:

Flexible filtering: Query by organization, project, document type, date range, etc.
Self-hosted option: Keep sensitive data in your VPC
Performance: Sub-millisecond search on millions of vectors
Payload flexibility: Store full chunk metadata without size limits

Multi-Tenant Data Isolation

Ensuring data isolation in a shared vector database is non-trivial. Fabric AI implements layered isolation:

Loading diagram...

Security layers:

Payload encryption: Organization-specific encryption keys
Mandatory filters: All queries include organizationId filter
Row-level security: Database-level enforcement in PostgreSQL metadata
Audit logging: All access logged with user and organization context
API authentication: JWT tokens with organization claims

A misconfigured query literally cannot return results from other organizations—the database prevents it.

Optimizing Vector Indexes

Qdrant uses HNSW (Hierarchical Navigable Small World) graphs for approximate nearest neighbor search. We tune several parameters:

1// Collection configuration
2{
3  vectors: {
4    size: 1536,              // Embedding dimensions
5    distance: "Cosine"       // Similarity metric
6  },
7  hnsw_config: {
8    m: 16,                   // Number of edges per node (tradeoff: accuracy vs. memory)
9    ef_construct: 100,       // Build-time quality (higher = better but slower indexing)
10    ef_search: 50            // Query-time quality (higher = better but slower queries)
11  },
12  optimizer_config: {
13    indexing_threshold: 20000 // Trigger indexing after N vectors
14  }
15}
16

These settings balance:

Accuracy: How close to the true nearest neighbors
Speed: Query latency
Memory: Index size
Build time: How long indexing takes

Different workloads require different tuning—Fabric AI adapts based on collection size and usage patterns.

Stage 5: Intelligent Retrieval

The final stage—and arguably the most important—is retrieval: given a user query, which chunks should we send to the LLM?

Naive RAG systems just do vector similarity search and call it a day. This fails because:

Too much context: Overwhelming the LLM with 50 chunks hurts performance
Too little context: Missing critical information leads to hallucinations
Irrelevant context: Similar vectors aren't always semantically relevant
Stale context: Old documents shouldn't outrank recent ones

Fabric AI implements multi-stage retrieval to solve these problems.

Stage 5.1: Query Understanding

Before searching, we analyze the user query:

1interface QueryAnalysis {
2  intent: "generate" | "search" | "summarize" | "compare";
3  entities: string[];        // Extracted entities
4  keywords: string[];        // Key terms
5  timeframe?: string;        // "recent", "last quarter", etc.
6  scope: "project" | "org" | "system";
7}
8

This informs:

How many chunks to retrieve: Summarization needs more context than generation
Which filters to apply: Recent documents for "what's new" queries
How to rank results: Prioritize specific projects or document types

Stage 5.2: Hybrid Search

We combine multiple search strategies:

Dense vector search: Semantic similarity using embeddings
Sparse keyword search: BM25 full-text search on chunk text
Metadata filtering: Exact matches on organizationId, projectId, tags, dates

1// Dense vector search
2const vectorResults = await qdrant.search({
3  collection: "chunks",
4  vector: queryEmbedding,
5  limit: 100,                          // Over-retrieve for reranking
6  filter: {
7    must: [
8      { key: "organizationId", match: { value: orgId } },
9      { key: "projectId", match: { value: projectId } }
10    ]
11  }
12});
13
14// Sparse keyword search (via PostgreSQL)
15const keywordResults = await db.chunk.findMany({
16  where: {
17    organizationId: orgId,
18    projectId: projectId,
19    text: { search: queryKeywords }  // Full-text search
20  },
21  take: 100
22});
23
24// Merge and deduplicate
25const combinedResults = mergeResults(vectorResults, keywordResults);
26

Hybrid search catches:

Semantic matches: "user authentication" matches "login flow"
Exact matches: "API version 2.1" matches precisely
Acronyms and abbreviations: "SDLC" matches "Software Development Lifecycle"

Stage 5.3: Reranking for Relevance

Raw retrieval results are reranked using multiple signals:

Loading diagram...

Reranking formula:

final_score = (
  0.4 * vector_similarity +
  0.2 * keyword_relevance +
  0.15 * recency_score +
  0.1 * usage_score +
  0.1 * approval_score +
  0.05 * structure_score
)

Weights are tuned based on query type and user preferences.

Stage 5.4: Context Assembly

The final step is assembling retrieved chunks into a coherent context for the LLM:

Deduplication: Remove overlapping chunks (remember our overlapping window strategy?)
Ordering: Sort by position in source documents to maintain narrative flow
Formatting: Add citations, headings, and metadata for LLM understanding
Token budget management: Fit context within the model's context window

1function assembleContext(chunks: Chunk[], maxTokens: number): string {
2  let context = "# Relevant Context from Past Documents\n\n";
3  let tokenCount = countTokens(context);
4  
5  for (const chunk of chunks) {
6    const citation = `## [${chunk.documentName}](${chunk.documentUrl})\n`;
7    const chunkText = `${chunk.text}\n\n`;
8    const chunkTokens = countTokens(citation + chunkText);
9    
10    if (tokenCount + chunkTokens > maxTokens) {
11      break; // Stop if we'd exceed token budget
12    }
13    
14    context += citation + chunkText;
15    tokenCount += chunkTokens;
16  }
17  
18  context += `\n---\nTotal context: ${chunks.length} chunks from ${uniqueDocuments(chunks).length} documents\n`;
19  return context;
20}
21

The assembled context is prepended to the user's prompt, giving the LLM access to your organization's specific knowledge.

Observability: Debugging RAG Systems

RAG systems are complex, and things can go wrong:

Low-quality extractions produce garbage embeddings
Retrieval returns irrelevant context
LLMs generate content that contradicts source documents
Performance degrades as the vector database grows

Fabric AI provides comprehensive observability:

Extraction Metrics

Loading diagram...

Tracked metrics:

Success rate per extractor
Average processing time by document size
Cost per page/megabyte
Error patterns (timeouts, API failures, format issues)

Retrieval Quality Metrics

Loading diagram...

Key questions we answer:

Are users approving AI-generated content (high approval = good retrieval)?
Do users edit specific sections (indicates missing or wrong context)?
Are certain document types under-represented in results?
How does retrieval quality degrade as the vector database grows?

End-to-End Tracing

Every RAG operation is traced through the entire pipeline:

[Workflow: doc_processing_abc123]
  ├─ [Activity: extract_document] 2.3s
  │   ├─ Try Unstructured.io
  │   │   └─ Success (2.1s, $0.05)
  │   └─ Extracted 15,342 characters
  ├─ [Activity: chunk_document] 0.4s
  │   └─ Generated 47 chunks (avg 326 tokens)
  ├─ [Activity: generate_embeddings] 1.8s
  │   └─ 47 embeddings (batch size: 25)
  ├─ [Activity: store_vectors] 0.3s
  │   └─ Stored in Qdrant collection: org_xyz_chunks
  └─ [Activity: update_status] 0.1s
      └─ Document status: READY

Total duration: 4.9s
Total cost: $0.07

This level of detail enables:

Performance debugging: Identify slow steps
Cost attribution: Track spending by organization and project
Error diagnosis: Pinpoint exactly where failures occur
Audit trails: Comply with enterprise governance requirements

Performance Optimization Techniques

As your knowledge base grows, maintaining performance requires ongoing optimization:

1. Lazy Loading and Pagination

Don't load all chunks into memory:

1// Bad: Loads all results into memory
2const results = await qdrant.search({ limit: 1000 });
3
4// Good: Paginate through results
5for (let offset = 0; offset < totalResults; offset += 100) {
6  const batch = await qdrant.search({ limit: 100, offset });
7  await processBatch(batch);
8}
9

2. Caching Frequently Accessed Embeddings

Popular documents are retrieved repeatedly—cache their embeddings:

1const embeddingCache = new LRU({ max: 10000 });
2
3async function getEmbedding(chunkId: string) {
4  const cached = embeddingCache.get(chunkId);
5  if (cached) return cached;
6  
7  const embedding = await fetchFromQdrant(chunkId);
8  embeddingCache.set(chunkId, embedding);
9  return embedding;
10}
11

3. Precomputed Aggregations

For common queries, precompute results:

1-- Materialized view for popular documents
2CREATE MATERIALIZED VIEW popular_documents AS
3SELECT 
4  document_id,
5  COUNT(*) as retrieval_count,
6  AVG(relevance_score) as avg_relevance
7FROM retrieval_logs
8WHERE created_at > NOW() - INTERVAL '30 days'
9GROUP BY document_id
10ORDER BY retrieval_count DESC;
11

4. Batch Operations

Minimize database round-trips:

1// Bad: N+1 queries
2for (const chunkId of chunkIds) {
3  const metadata = await db.chunk.findUnique({ where: { id: chunkId } });
4  await enrichWithMetadata(metadata);
5}
6
7// Good: Single batch query
8const metadata = await db.chunk.findMany({
9  where: { id: { in: chunkIds } }
10});
11const metadataMap = new Map(metadata.map(m => [m.id, m]));
12

5. Index Optimization

Monitor and optimize PostgreSQL indexes:

1-- Composite index for common filter patterns
2CREATE INDEX idx_chunks_org_project_date 
3ON chunks (organization_id, project_id, created_at DESC);
4
5-- GiST index for full-text search
6CREATE INDEX idx_chunks_text_search 
7ON chunks USING GiST (to_tsvector('english', text));
8

Security Considerations

RAG systems handle sensitive enterprise data—security is paramount:

1. Encryption at Rest

Vector database: Qdrant collections encrypted with organization-specific keys
PostgreSQL: Transparent data encryption (TDE) enabled
S3: Server-side encryption (SSE-KMS) with customer-managed keys

2. Encryption in Transit

TLS 1.3: All inter-service communication encrypted
mTLS: Mutual authentication between services
VPC peering: Database and vector store in private subnets

3. Access Controls

1// Row-level security in PostgreSQL
2CREATE POLICY chunk_isolation ON chunks
3USING (organization_id = current_setting('app.current_org_id')::text);
4
5// Qdrant filter enforcement
6const results = await qdrant.search({
7  filter: {
8    must: [
9      { key: "organizationId", match: { value: getOrgId() } }
10    ]
11  }
12});
13

4. Audit Logging

Every operation is logged:

1interface AuditLog {
2  timestamp: Date;
3  userId: string;
4  organizationId: string;
5  action: "search" | "retrieve" | "generate";
6  resourceId: string;
7  metadata: {
8    query?: string;
9    resultsCount?: number;
10    documentsAccessed?: string[];
11  };
12  ipAddress: string;
13  userAgent: string;
14}
15

Logs are:

Immutable: Append-only, tamper-proof
Retained: Per compliance requirements (1-7 years)
Monitored: Automated anomaly detection

The Road Ahead: Future RAG Enhancements

We're continuously improving our RAG architecture:

1. Multimodal RAG

Extend beyond text to handle:

Images and diagrams: Architecture diagrams, UI mockups, charts
Videos and audio: Meeting recordings, presentation videos
Code: Semantic code search and understanding

2. GraphRAG

Represent document relationships as knowledge graphs:

Entity linking: Connect mentions across documents
Relationship extraction: Understand how concepts relate
Path-based retrieval: Find documents via relationship chains

3. Adaptive Retrieval

Learn from user feedback:

Reinforcement learning: Optimize ranking based on approvals
Personalization: Learn user preferences for context
Query expansion: Automatically add related terms

4. Cross-lingual RAG

Support multilingual organizations:

Language detection: Automatically identify chunk languages
Cross-lingual retrieval: Query in English, retrieve Spanish
Translation: Seamless context assembly across languages

Conclusion: RAG as a Competitive Advantage

RAG is not just a technical implementation detail—it's the foundation that enables Fabric AI to transform enterprise knowledge into actionable intelligence. By solving the hard problems of:

Extraction from messy real-world documents
Chunking that preserves semantic meaning
Embedding with quality validation
Storage with multi-tenant isolation
Retrieval with intelligent ranking

...we've built a system that turns your accumulated documentation into a competitive advantage.

Your competitors are using generic AI. You're using AI trained on your specific organizational knowledge.

That's the difference between 10x faster and 10x better.

Want to see how Fabric AI's RAG architecture works with your data? Request a demo and we'll show you a live extraction, embedding, and retrieval pipeline with your actual documents.

Next up: In our third post, we'll dive into how Temporal workflows enable durable, fault-tolerant execution for long-running AI operations. Stay tuned!