Deep Dive: How Fabric AI's RAG Architecture Unlocks Enterprise Knowledge
Deep Dive: How Fabric AI's RAG Architecture Unlocks Enterprise Knowledge
In our introduction to Fabric AI, we discussed how the platform compresses the software development lifecycle through intelligent automation. Today, we're diving deep into the technical foundation that makes this possible: our Retrieval-Augmented Generation (RAG) architecture.
RAG is more than a buzzword—it's the critical technology that enables AI agents to work with your specific organizational context, standards, and accumulated knowledge. But implementing RAG at enterprise scale requires solving challenges that most off-the-shelf solutions ignore:
- How do you extract clean, structured text from messy real-world documents?
- How do you chunk documents to preserve semantic meaning?
- How do you retrieve the right context without overwhelming the LLM?
- How do you ensure multi-tenant data isolation in vector databases?
- How do you make the entire pipeline fault-tolerant and observable?
Let's explore how Fabric AI solves each of these challenges.
The RAG Problem: Context is Everything
Modern large language models (LLMs) are incredibly capable, but they have a fundamental limitation: they don't know about your organization's specific data, processes, or standards. Ask GPT-4 to generate a PRD, and you'll get a generic template. Ask it to write API documentation, and it won't know your naming conventions or architectural patterns.
The naive solution is to fine-tune a model on your data. But fine-tuning is:
- Expensive: Thousands of dollars per training run
- Slow: Days or weeks to complete
- Static: Models become outdated as soon as your documentation changes
- Opaque: Hard to understand what the model "learned"
- Risky: Training data leakage concerns with sensitive enterprise data
RAG provides a better approach: Keep the base model unchanged, but dynamically inject relevant context into each request. This way, the model applies its general reasoning abilities to your specific organizational knowledge.
Architecture Overview: Five Stages of RAG
Fabric AI's RAG pipeline consists of five interconnected stages, each addressing a specific technical challenge:
Let's examine each stage in detail.
Stage 1: Multi-Format Document Extraction
Enterprises don't store knowledge in neat Markdown files. Real-world data comes in:
- Scanned PDFs with complex layouts
- PowerPoint presentations with embedded diagrams
- Word documents with tables and images
- Confluence pages with nested content
- Legacy formats (older Office versions, proprietary formats)
- Code repositories with inline documentation
Each format requires specialized extraction logic. Fabric AI solves this with a multi-extractor strategy.
The Extraction Factory Pattern
We've implemented an abstraction layer that routes documents to the appropriate extractor based on:
- File format: MIME type and extension detection
- Content complexity: Layout analysis, table detection
- Cost constraints: User or organization budget limits
- Extraction strategy: Speed vs. accuracy tradeoffs
Extractor Selection Logic
Unstructured.io (Priority 1):
- Best for: Complex layouts, academic papers, mixed-format documents
- Pros: Excellent table extraction, understands document structure
- Cons: Higher cost, slower processing
- Use when: Accuracy is critical, cost is not a constraint
LlamaParse (Priority 2):
- Best for: Technical documentation, code comments, markdown-heavy content
- Pros: Optimized for developer content, preserves code formatting
- Cons: Less effective with visual layouts
- Use when: Processing code repositories or technical specs
Azure Document Intelligence (Priority 3):
- Best for: Enterprise compliance requirements, forms, invoices
- Pros: SOC 2 compliant, excellent OCR, layout understanding
- Cons: Requires Azure subscription, regional availability
- Use when: Enterprise governance is required
AWS Textract (Priority 4):
- Best for: High-volume form processing, structured documents
- Pros: Fast, cost-effective at scale, good table extraction
- Cons: Less effective with unstructured content
- Use when: Processing large batches of similar documents
Hybrid PDF Extractor (Fallback):
- Best for: Simple PDFs, local processing
- Combines: pdf-parse for text + Tesseract OCR for images
- Use when: All other extractors fail or are unavailable
Cost-Aware Extraction
Organizations can configure extraction strategies:
1// Local-only strategy (free, fast, lower accuracy)
2strategy: "local-only"
3
4// Hybrid strategy (fallback to cloud if local fails)
5strategy: "hybrid"
6
7// Cloud-first strategy (best accuracy, higher cost)
8strategy: "cloud-first"
9
10// Budget-aware (automatic provider selection based on cost)
11strategy: "budget-aware"
12maxCost: 0.10 // Maximum cost per document in USD
13The system tracks usage per user and organization, preventing runaway costs while maximizing extraction quality.
Stage 2: Semantic Chunking
Once text is extracted, we face a new challenge: how do we split it into manageable pieces for embedding and retrieval?
Naive approaches fail:
- Fixed-size chunks (e.g., every 500 tokens) break semantic units mid-sentence
- Paragraph-based chunks create uneven sizes and lose cross-paragraph context
- Section-based chunks can be too large for embedding models (which have token limits)
The Semantic Chunking Algorithm
Fabric AI implements recursive semantic chunking that:
- Respects document structure: Preserves headings, lists, code blocks
- Maintains semantic coherence: Never splits mid-sentence or mid-thought
- Creates overlapping windows: Chunks share context to prevent information loss
- Adapts to content type: Different strategies for code vs. prose vs. tables
Chunk Metadata Enrichment
Each chunk is enriched with metadata for better retrieval:
1interface SemanticChunk {
2 text: string; // The actual chunk text
3 tokens: number; // Token count
4
5 // Position metadata
6 documentId: string;
7 chunkIndex: number; // Position in document
8 totalChunks: number;
9
10 // Structural metadata
11 headingPath: string[]; // ["Introduction", "Architecture", "RAG Pipeline"]
12 sectionType: string; // "heading", "paragraph", "code", "list", "table"
13 depth: number; // Heading depth (h1=1, h2=2, etc.)
14
15 // Semantic metadata
16 keywords: string[]; // Extracted keywords
17 entities: string[]; // Named entities (people, orgs, tech)
18 language: string; // Detected language
19
20 // Context metadata
21 previousChunk?: string; // ID of previous chunk
22 nextChunk?: string; // ID of next chunk
23 relatedChunks: string[]; // Cross-references
24
25 // Tenant metadata
26 organizationId: string;
27 projectId?: string;
28 tags: string[];
29}
30This rich metadata enables sophisticated filtering during retrieval.
Stage 3: Vector Embedding Generation
With semantically coherent chunks, we generate vector embeddings—high-dimensional numerical representations that capture semantic meaning.
Embedding Model Selection
Fabric AI supports multiple embedding providers:
| Provider | Model | Dimensions | Max Tokens | Use Case | |----------|-------|------------|------------|----------| | OpenAI | text-embedding-3-large | 3072 | 8191 | Best overall quality, expensive | | OpenAI | text-embedding-3-small | 1536 | 8191 | Good balance of cost/quality | | Cohere | embed-multilingual-v3.0 | 1024 | 512 | Multilingual support | | Azure OpenAI | text-embedding-ada-002 | 1536 | 8191 | Enterprise compliance | | Custom | (user-provided) | Variable | Variable | On-premise requirements |
The choice of embedding model is critical and affects:
- Retrieval quality: Better embeddings = more relevant results
- Cost: Embeddings are generated once but retrieved many times
- Latency: Larger embedding dimensions increase search time
- Storage: Higher dimensions require more vector database storage
Batch Processing for Efficiency
Embedding generation is optimized for throughput:
1// Naive approach: One API call per chunk (slow, expensive)
2for (const chunk of chunks) {
3 const embedding = await embedChunk(chunk);
4 await storeEmbedding(embedding);
5}
6
7// Optimized approach: Batch processing
8const BATCH_SIZE = 100; // Provider-dependent
9for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
10 const batch = chunks.slice(i, i + BATCH_SIZE);
11 const embeddings = await embedBatch(batch); // Single API call
12 await storeBatch(embeddings); // Batch database write
13}
14This reduces:
- API calls: 100x fewer calls for 10,000 chunks
- Latency: Parallel processing within batches
- Cost: Batch pricing discounts from providers
Embedding Quality Validation
Not all embeddings are equal. We validate quality through:
- Magnitude check: Ensure vectors are normalized (unit vectors)
- Similarity sanity tests: Related chunks should have high cosine similarity
- Outlier detection: Flag chunks with unusual embedding patterns
- Cross-validation: Verify retrieval quality on sample queries
Poor-quality embeddings are flagged for re-processing or manual review.
Stage 4: Vector Storage with Qdrant
With embeddings generated, we need fast, scalable storage with sophisticated filtering capabilities. Enter Qdrant—a vector database built specifically for production RAG systems.
Why Qdrant Over Alternatives?
We evaluated several vector databases:
| Feature | Qdrant | Pinecone | Weaviate | Milvus | |---------|--------|----------|----------|--------| | Performance | Excellent | Excellent | Good | Excellent | | Multi-tenancy | Native filtering | Namespace-based | GraphQL filters | Partitions | | Metadata filtering | Rich (nested JSON) | Limited | Rich | Limited | | Self-hosted | Yes | No | Yes | Yes | | Managed option | Yes (Qdrant Cloud) | Yes | Yes | Yes (Zilliz) | | Hybrid search | Yes (dense + sparse) | No | Yes | Yes | | Payload size | Unlimited | 40KB limit | Large | Large |
Qdrant wins for enterprise RAG because:
- Flexible filtering: Query by organization, project, document type, date range, etc.
- Self-hosted option: Keep sensitive data in your VPC
- Performance: Sub-millisecond search on millions of vectors
- Payload flexibility: Store full chunk metadata without size limits
Multi-Tenant Data Isolation
Ensuring data isolation in a shared vector database is non-trivial. Fabric AI implements layered isolation:
Security layers:
- Payload encryption: Organization-specific encryption keys
- Mandatory filters: All queries include
organizationIdfilter - Row-level security: Database-level enforcement in PostgreSQL metadata
- Audit logging: All access logged with user and organization context
- API authentication: JWT tokens with organization claims
A misconfigured query literally cannot return results from other organizations—the database prevents it.
Optimizing Vector Indexes
Qdrant uses HNSW (Hierarchical Navigable Small World) graphs for approximate nearest neighbor search. We tune several parameters:
1// Collection configuration
2{
3 vectors: {
4 size: 1536, // Embedding dimensions
5 distance: "Cosine" // Similarity metric
6 },
7 hnsw_config: {
8 m: 16, // Number of edges per node (tradeoff: accuracy vs. memory)
9 ef_construct: 100, // Build-time quality (higher = better but slower indexing)
10 ef_search: 50 // Query-time quality (higher = better but slower queries)
11 },
12 optimizer_config: {
13 indexing_threshold: 20000 // Trigger indexing after N vectors
14 }
15}
16These settings balance:
- Accuracy: How close to the true nearest neighbors
- Speed: Query latency
- Memory: Index size
- Build time: How long indexing takes
Different workloads require different tuning—Fabric AI adapts based on collection size and usage patterns.
Stage 5: Intelligent Retrieval
The final stage—and arguably the most important—is retrieval: given a user query, which chunks should we send to the LLM?
Naive RAG systems just do vector similarity search and call it a day. This fails because:
- Too much context: Overwhelming the LLM with 50 chunks hurts performance
- Too little context: Missing critical information leads to hallucinations
- Irrelevant context: Similar vectors aren't always semantically relevant
- Stale context: Old documents shouldn't outrank recent ones
Fabric AI implements multi-stage retrieval to solve these problems.
Stage 5.1: Query Understanding
Before searching, we analyze the user query:
1interface QueryAnalysis {
2 intent: "generate" | "search" | "summarize" | "compare";
3 entities: string[]; // Extracted entities
4 keywords: string[]; // Key terms
5 timeframe?: string; // "recent", "last quarter", etc.
6 scope: "project" | "org" | "system";
7}
8This informs:
- How many chunks to retrieve: Summarization needs more context than generation
- Which filters to apply: Recent documents for "what's new" queries
- How to rank results: Prioritize specific projects or document types
Stage 5.2: Hybrid Search
We combine multiple search strategies:
- Dense vector search: Semantic similarity using embeddings
- Sparse keyword search: BM25 full-text search on chunk text
- Metadata filtering: Exact matches on organizationId, projectId, tags, dates
1// Dense vector search
2const vectorResults = await qdrant.search({
3 collection: "chunks",
4 vector: queryEmbedding,
5 limit: 100, // Over-retrieve for reranking
6 filter: {
7 must: [
8 { key: "organizationId", match: { value: orgId } },
9 { key: "projectId", match: { value: projectId } }
10 ]
11 }
12});
13
14// Sparse keyword search (via PostgreSQL)
15const keywordResults = await db.chunk.findMany({
16 where: {
17 organizationId: orgId,
18 projectId: projectId,
19 text: { search: queryKeywords } // Full-text search
20 },
21 take: 100
22});
23
24// Merge and deduplicate
25const combinedResults = mergeResults(vectorResults, keywordResults);
26Hybrid search catches:
- Semantic matches: "user authentication" matches "login flow"
- Exact matches: "API version 2.1" matches precisely
- Acronyms and abbreviations: "SDLC" matches "Software Development Lifecycle"
Stage 5.3: Reranking for Relevance
Raw retrieval results are reranked using multiple signals:
Reranking formula:
final_score = (
0.4 * vector_similarity +
0.2 * keyword_relevance +
0.15 * recency_score +
0.1 * usage_score +
0.1 * approval_score +
0.05 * structure_score
)
Weights are tuned based on query type and user preferences.
Stage 5.4: Context Assembly
The final step is assembling retrieved chunks into a coherent context for the LLM:
- Deduplication: Remove overlapping chunks (remember our overlapping window strategy?)
- Ordering: Sort by position in source documents to maintain narrative flow
- Formatting: Add citations, headings, and metadata for LLM understanding
- Token budget management: Fit context within the model's context window
1function assembleContext(chunks: Chunk[], maxTokens: number): string {
2 let context = "# Relevant Context from Past Documents\n\n";
3 let tokenCount = countTokens(context);
4
5 for (const chunk of chunks) {
6 const citation = `## [${chunk.documentName}](${chunk.documentUrl})\n`;
7 const chunkText = `${chunk.text}\n\n`;
8 const chunkTokens = countTokens(citation + chunkText);
9
10 if (tokenCount + chunkTokens > maxTokens) {
11 break; // Stop if we'd exceed token budget
12 }
13
14 context += citation + chunkText;
15 tokenCount += chunkTokens;
16 }
17
18 context += `\n---\nTotal context: ${chunks.length} chunks from ${uniqueDocuments(chunks).length} documents\n`;
19 return context;
20}
21The assembled context is prepended to the user's prompt, giving the LLM access to your organization's specific knowledge.
Observability: Debugging RAG Systems
RAG systems are complex, and things can go wrong:
- Low-quality extractions produce garbage embeddings
- Retrieval returns irrelevant context
- LLMs generate content that contradicts source documents
- Performance degrades as the vector database grows
Fabric AI provides comprehensive observability:
Extraction Metrics
Tracked metrics:
- Success rate per extractor
- Average processing time by document size
- Cost per page/megabyte
- Error patterns (timeouts, API failures, format issues)
Retrieval Quality Metrics
Key questions we answer:
- Are users approving AI-generated content (high approval = good retrieval)?
- Do users edit specific sections (indicates missing or wrong context)?
- Are certain document types under-represented in results?
- How does retrieval quality degrade as the vector database grows?
End-to-End Tracing
Every RAG operation is traced through the entire pipeline:
[Workflow: doc_processing_abc123]
├─ [Activity: extract_document] 2.3s
│ ├─ Try Unstructured.io
│ │ └─ Success (2.1s, $0.05)
│ └─ Extracted 15,342 characters
├─ [Activity: chunk_document] 0.4s
│ └─ Generated 47 chunks (avg 326 tokens)
├─ [Activity: generate_embeddings] 1.8s
│ └─ 47 embeddings (batch size: 25)
├─ [Activity: store_vectors] 0.3s
│ └─ Stored in Qdrant collection: org_xyz_chunks
└─ [Activity: update_status] 0.1s
└─ Document status: READY
Total duration: 4.9s
Total cost: $0.07
This level of detail enables:
- Performance debugging: Identify slow steps
- Cost attribution: Track spending by organization and project
- Error diagnosis: Pinpoint exactly where failures occur
- Audit trails: Comply with enterprise governance requirements
Performance Optimization Techniques
As your knowledge base grows, maintaining performance requires ongoing optimization:
1. Lazy Loading and Pagination
Don't load all chunks into memory:
1// Bad: Loads all results into memory
2const results = await qdrant.search({ limit: 1000 });
3
4// Good: Paginate through results
5for (let offset = 0; offset < totalResults; offset += 100) {
6 const batch = await qdrant.search({ limit: 100, offset });
7 await processBatch(batch);
8}
92. Caching Frequently Accessed Embeddings
Popular documents are retrieved repeatedly—cache their embeddings:
1const embeddingCache = new LRU({ max: 10000 });
2
3async function getEmbedding(chunkId: string) {
4 const cached = embeddingCache.get(chunkId);
5 if (cached) return cached;
6
7 const embedding = await fetchFromQdrant(chunkId);
8 embeddingCache.set(chunkId, embedding);
9 return embedding;
10}
113. Precomputed Aggregations
For common queries, precompute results:
1-- Materialized view for popular documents
2CREATE MATERIALIZED VIEW popular_documents AS
3SELECT
4 document_id,
5 COUNT(*) as retrieval_count,
6 AVG(relevance_score) as avg_relevance
7FROM retrieval_logs
8WHERE created_at > NOW() - INTERVAL '30 days'
9GROUP BY document_id
10ORDER BY retrieval_count DESC;
114. Batch Operations
Minimize database round-trips:
1// Bad: N+1 queries
2for (const chunkId of chunkIds) {
3 const metadata = await db.chunk.findUnique({ where: { id: chunkId } });
4 await enrichWithMetadata(metadata);
5}
6
7// Good: Single batch query
8const metadata = await db.chunk.findMany({
9 where: { id: { in: chunkIds } }
10});
11const metadataMap = new Map(metadata.map(m => [m.id, m]));
125. Index Optimization
Monitor and optimize PostgreSQL indexes:
1-- Composite index for common filter patterns
2CREATE INDEX idx_chunks_org_project_date
3ON chunks (organization_id, project_id, created_at DESC);
4
5-- GiST index for full-text search
6CREATE INDEX idx_chunks_text_search
7ON chunks USING GiST (to_tsvector('english', text));
8Security Considerations
RAG systems handle sensitive enterprise data—security is paramount:
1. Encryption at Rest
- Vector database: Qdrant collections encrypted with organization-specific keys
- PostgreSQL: Transparent data encryption (TDE) enabled
- S3: Server-side encryption (SSE-KMS) with customer-managed keys
2. Encryption in Transit
- TLS 1.3: All inter-service communication encrypted
- mTLS: Mutual authentication between services
- VPC peering: Database and vector store in private subnets
3. Access Controls
1// Row-level security in PostgreSQL
2CREATE POLICY chunk_isolation ON chunks
3USING (organization_id = current_setting('app.current_org_id')::text);
4
5// Qdrant filter enforcement
6const results = await qdrant.search({
7 filter: {
8 must: [
9 { key: "organizationId", match: { value: getOrgId() } }
10 ]
11 }
12});
134. Audit Logging
Every operation is logged:
1interface AuditLog {
2 timestamp: Date;
3 userId: string;
4 organizationId: string;
5 action: "search" | "retrieve" | "generate";
6 resourceId: string;
7 metadata: {
8 query?: string;
9 resultsCount?: number;
10 documentsAccessed?: string[];
11 };
12 ipAddress: string;
13 userAgent: string;
14}
15Logs are:
- Immutable: Append-only, tamper-proof
- Retained: Per compliance requirements (1-7 years)
- Monitored: Automated anomaly detection
The Road Ahead: Future RAG Enhancements
We're continuously improving our RAG architecture:
1. Multimodal RAG
Extend beyond text to handle:
- Images and diagrams: Architecture diagrams, UI mockups, charts
- Videos and audio: Meeting recordings, presentation videos
- Code: Semantic code search and understanding
2. GraphRAG
Represent document relationships as knowledge graphs:
- Entity linking: Connect mentions across documents
- Relationship extraction: Understand how concepts relate
- Path-based retrieval: Find documents via relationship chains
3. Adaptive Retrieval
Learn from user feedback:
- Reinforcement learning: Optimize ranking based on approvals
- Personalization: Learn user preferences for context
- Query expansion: Automatically add related terms
4. Cross-lingual RAG
Support multilingual organizations:
- Language detection: Automatically identify chunk languages
- Cross-lingual retrieval: Query in English, retrieve Spanish
- Translation: Seamless context assembly across languages
Conclusion: RAG as a Competitive Advantage
RAG is not just a technical implementation detail—it's the foundation that enables Fabric AI to transform enterprise knowledge into actionable intelligence. By solving the hard problems of:
- Extraction from messy real-world documents
- Chunking that preserves semantic meaning
- Embedding with quality validation
- Storage with multi-tenant isolation
- Retrieval with intelligent ranking
...we've built a system that turns your accumulated documentation into a competitive advantage.
Your competitors are using generic AI. You're using AI trained on your specific organizational knowledge.
That's the difference between 10x faster and 10x better.
Want to see how Fabric AI's RAG architecture works with your data? Request a demo and we'll show you a live extraction, embedding, and retrieval pipeline with your actual documents.
Next up: In our third post, we'll dive into how Temporal workflows enable durable, fault-tolerant execution for long-running AI operations. Stay tuned!
