Skip to main content

Coordination Patterns

How agents in a swarm communicate and coordinate.

Pattern 1: Orchestrator-Worker

A central orchestrator (Lima) directs worker agents.

flowchart TB
Orch[Lima Orchestrator]

Orch --> A[Agent A]
Orch --> B[Agent B]
Orch --> C[Agent C]

A --> Orch
B --> Orch
C --> Orch

Use Case: Most common pattern in Crella Engine.

Pros:

  • Clear command structure
  • Easy to debug
  • Centralized state

Cons:

  • Orchestrator is single point of failure
  • Can bottleneck at scale

Pattern 2: Pipeline (Sequential)

Agents process in sequence, each passing to the next.

flowchart LR
A[Alpha] --> B[Bravo] --> C[Charlie] --> D[Delta]

Use Case: Linear workflows like document processing.

Pros:

  • Simple to understand
  • Clear dependencies
  • Easy error tracking

Cons:

  • Slowest pattern
  • One failure stops pipeline

Pattern 3: Fan-Out / Fan-In

One agent distributes work, another collects results.

flowchart TB
Distribute[Distribute Task]

Distribute --> A[Worker A]
Distribute --> B[Worker B]
Distribute --> C[Worker C]

A --> Collect[Collect Results]
B --> Collect
C --> Collect

Use Case: Parallel research, batch processing.

Pros:

  • Maximum parallelism
  • Fastest for independent tasks

Cons:

  • Requires aggregation logic
  • Complex error handling

Pattern 4: Peer-to-Peer

Agents communicate directly with each other.

flowchart LR
A[Agent A] <--> B[Agent B]
B <--> C[Agent C]
A <--> C

Use Case: Collaborative tasks, real-time adjustments.

Pros:

  • Flexible
  • No single point of failure
  • Low latency

Cons:

  • Complex coordination
  • Harder to debug
  • State management challenging

Pattern 5: Event-Driven

Agents respond to events, not direct commands.

flowchart TB
Event[Event Bus]

Event --> |document.uploaded| Charlie[Charlie]
Event --> |lead.qualified| Echo[Echo]
Event --> |response.received| India[India]

Charlie --> |document.processed| Event
Echo --> |lead.enriched| Event
India --> |escalation.created| Event

Use Case: Decoupled systems, async processing.

Pros:

  • Highly scalable
  • Loose coupling
  • Easy to add agents

Cons:

  • Eventually consistent
  • Debugging complexity

Communication Protocols

Synchronous (Request-Response)

// Agent A calls Agent B directly
const result = await agentB.process(data);

When to use: Need immediate result, simple operations.

Asynchronous (Message Queue)

// Agent A publishes to queue
await queue.publish('charlie.tasks', { document: data });

// Charlie subscribes and processes
queue.subscribe('charlie.tasks', async (msg) => {
const result = await process(msg.document);
await queue.publish('charlie.results', result);
});

When to use: High volume, fire-and-forget, resilience needed.

State Management

Centralized State (Redis)

flowchart TB
subgraph agents [Agents]
A[Alpha]
B[Bravo]
C[Charlie]
end

Redis[(Redis State Store)]

A --> Redis
B --> Redis
C --> Redis

Use: Shared counters, session data, rate limits.

Distributed State (Per-Agent)

flowchart LR
A[Alpha + State A]
B[Bravo + State B]
C[Charlie + State C]

Use: Agent-specific state, no sharing needed.

Workflow State (Lima)

{
"workflow_id": "wf-123",
"current_step": 3,
"state": {
"lead_validated": true,
"enrichment_complete": true,
"email_generated": false
},
"history": [
{"agent": "ALPHA001", "result": "success"},
{"agent": "ECHO001", "result": "success"}
]
}

Error Handling

Retry Pattern

flowchart TB
Task[Execute Task] --> Check{Success?}
Check -->|Yes| Done[Complete]
Check -->|No| Retry{Attempts < Max?}
Retry -->|Yes| Wait[Backoff Wait]
Wait --> Task
Retry -->|No| Escalate[Escalate/Fail]

Circuit Breaker

stateDiagram-v2
[*] --> Closed
Closed --> Open: Failures > Threshold
Open --> HalfOpen: Timeout Elapsed
HalfOpen --> Closed: Success
HalfOpen --> Open: Failure

Dead Letter Queue

Failed messages go to a DLQ for manual review:

Main Queue → Agent → Success ✓

Failure (after retries)

Dead Letter Queue → Human Review

Best Practices

  1. Start with Orchestrator-Worker — Simplest pattern, optimize later
  2. Use async for high volume — Don't block on slow operations
  3. Implement idempotency — Safe to retry any operation
  4. Log everything — Correlation IDs across agents
  5. Set timeouts — Don't let agents hang forever
  6. Design for failure — Every call can fail