Durable Workflows
How Fabric AI uses Temporal for fault-tolerant, resumable workflow execution.
Fabric AI uses Temporal to power its workflow engine, ensuring that complex multi-step operations complete reliably—even when things go wrong.
Why Workflows Matter
The Problem with Traditional Execution
Traditional request-response systems have limitations:
User Request → Server Process → Response
│
├── Network timeout? ❌ Lost work
├── Server crash? ❌ Lost work
├── API rate limit? ❌ Lost work
└── User closes browser? ❌ Lost workWhen generating a PRD that takes 5 minutes, or creating 20 Jira tickets, any interruption means starting over.
The Workflow Solution
Workflows persist state at every step:
User Request → Workflow Started → Step 1 ✓ → Step 2 ✓ → Step 3...
│ │ │
│ │ └── State saved
│ └── State saved
└── State saved
If anything fails:
...→ Step 3 (resume from here) → Step 4 → Complete!How Temporal Works
Key Concepts
Workflows Long-running, fault-tolerant processes that orchestrate activities.
Activities Individual units of work like API calls, AI generation, or file operations.
Workers Processes that execute workflows and activities.
Signals External events that can modify running workflows.
Queries Read the current state of a running workflow.
Workflow Lifecycle
┌──────────────────────────────────────────────────────────────────┐
│ Workflow Lifecycle │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ PENDING │ -> │ RUNNING │ -> │COMPLETED│ or │ FAILED │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ WAITING │ (for signal/approval) │
│ └────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ RESUMED │ (after signal received) │
│ └────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘Workflows in Fabric
Document Generation Workflow
When you ask an agent to generate a document, this workflow executes:
Initialize Context
Load user preferences, organization settings, and conversation history.
Retrieve RAG Context
Query Qdrant for relevant document chunks based on the request.
Generate Content
Call the AI model with context to generate the document.
Apply Formatting
Format the output according to document type (PRD, spec, etc.).
Save and Return
Store the document in PostgreSQL and return to the user.
Each step is an activity that:
- Automatically retries on failure
- Has configurable timeouts
- Saves state before and after
Orchestrator Workflow
The Fabric Orchestrator runs a more complex workflow:
┌─────────────────────────────────────────────────────────────────┐
│ Orchestrator Workflow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. INITIALIZE │
│ • Load memory and context │
│ • Retrieve workspace documents │
│ │
│ 2. ROUTE │
│ • Analyze request semantically │
│ • Find matching agents/tools │
│ │
│ 3. PLAN │
│ • Decompose into steps │
│ • Identify dependencies │
│ • Check approval requirements │
│ │
│ 4. EXECUTE (loop for each step) │
│ ├── Step needs approval? → Signal: await_approval │
│ ├── Execute MCP tool or delegate to agent │
│ ├── Handle errors (retry/skip/fail) │
│ └── Record result in context bag │
│ │
│ 5. LEARN │
│ • Record success/failure patterns │
│ • Update memory for future executions │
│ │
└─────────────────────────────────────────────────────────────────┘Reliability Features
Automatic Retries
Activities retry automatically with exponential backoff:
Attempt 1: Execute immediately
↓ Failed
Attempt 2: Wait 2 seconds, retry
↓ Failed
Attempt 3: Wait 4 seconds, retry
↓ Failed
Attempt 4: Wait 8 seconds, retry
↓ Success!Configuration:
- Initial interval — 1 second
- Maximum interval — 5 minutes
- Maximum attempts — 5 (configurable)
- Backoff coefficient — 2.0
Heartbeats
Long-running activities send heartbeats to indicate they're still alive:
// Example activity with heartbeats
async function generateLargeDocument(input: DocumentInput) {
for (const section of input.sections) {
// Process section...
// Send heartbeat to indicate progress
heartbeat({ section: section.name, progress: 50 });
}
return document;
}If heartbeats stop, Temporal can retry the activity on a different worker.
Timeouts
Multiple timeout types protect against hanging operations:
| Timeout Type | Purpose | Default |
|---|---|---|
| Start-to-close | Max time for single attempt | 5 minutes |
| Schedule-to-close | Max time including retries | 30 minutes |
| Heartbeat | Max time between heartbeats | 1 minute |
| Schedule-to-start | Max time in queue | 10 minutes |
Human-in-the-Loop
Workflows can pause for human approval:
How It Works
Workflow running...
│
├── Detects high-risk operation (e.g., delete 50 records)
│
▼
┌─────────────────────────────────────────────────────┐
│ AWAITING APPROVAL │
│ │
│ "The workflow wants to delete 50 Jira tickets. │
│ Do you want to proceed?" │
│ │
│ [Approve] [Reject] [Modify] │
└─────────────────────────────────────────────────────┘
│
├── User clicks "Approve"
│
▼
Workflow continues...Signal-Based Approvals
Approvals are implemented using Temporal signals:
// Wait for approval signal
const approval = await condition(
() => approvalReceived,
{ timeout: '24 hours' }
);
if (approval.approved) {
// Continue with operation
} else {
// Skip or fail gracefully
}Key Features:
- Workflow pauses indefinitely (or until timeout)
- State is preserved while waiting
- User can approve/reject anytime
- Workflow resumes immediately after signal
Observability
Temporal UI
Access the Temporal UI to monitor workflows:
- List all workflows — See running, completed, and failed
- View timeline — Step-by-step execution visualization
- Inspect state — Current workflow variables
- Replay — Re-execute failed workflows
Access: http://localhost:8233 (local development)
Workflow Queries
Query running workflow state:
// Get current progress
const progress = await handle.query('progress');
// → { currentStep: 3, totalSteps: 5, status: 'executing' }
// Get generated artifacts
const artifacts = await handle.query('artifacts');
// → [{ type: 'prd', name: 'Authentication PRD' }]Event History
Every workflow maintains a complete event history:
Event 1: WorkflowExecutionStarted
Event 2: ActivityTaskScheduled (retrieveContext)
Event 3: ActivityTaskStarted
Event 4: ActivityTaskCompleted
Event 5: ActivityTaskScheduled (generateDocument)
Event 6: ActivityTaskStarted
Event 7: ActivityTaskCompleted
Event 8: WorkflowExecutionCompletedThis history enables:
- Debugging — See exactly what happened
- Replay — Re-execute with same inputs
- Audit — Complete compliance trail
Trust-Based Approvals
The Orchestrator learns from your approval patterns:
How It Works
Week 1:
Operation: Post to Slack → Request approval → Approved
Operation: Create Jira ticket → Request approval → Approved
Operation: Post to Slack → Request approval → ApprovedWeek 2:
Operation: Post to Slack → Auto-approved (you always approve)
Operation: Create Jira ticket → Request approval → ApprovedWeek 4:
Operation: Post to Slack → Auto-approved
Operation: Create Jira ticket → Auto-approved
Operation: DELETE 100 records → Request approval (always ask for deletes)Risk Levels
| Risk Level | Examples | Default Behavior |
|---|---|---|
| Low | Read, list, search | Auto-approve |
| Medium | Create, update | Learn from patterns |
| High | Bulk operations | Usually request approval |
| Critical | Delete, financial | Always request approval |
Best Practices
Designing for Reliability
Do:
- Break work into small activities
- Use idempotent operations when possible
- Handle partial success gracefully
- Set appropriate timeouts
Don't:
- Put too much logic in a single activity
- Assume network calls will succeed
- Skip error handling
- Use infinite timeouts
Monitoring
- Check Temporal UI regularly for failed workflows
- Set up alerts for workflow failures
- Review execution times for optimization
- Audit approval patterns