Add to Cursor Add to Claude Copy for LLM View as MD

Architecture & Guarantees

This page explains how Hatchet is put together, what the main components do, and what reliability guarantees you should design your workers around.

Architecture overview

Hatchet has three main moving pieces:

API server: the HTTP surface area for triggering workflows, querying state, and powering the UI
Engine: schedules and dispatches work, enforces dependencies/policies, and records state transitions durably
Workers: your processes that run the actual task code

State is stored durably (PostgreSQL is the source of truth). In many deployments that’s enough—no separate broker required—while self-hosted high-throughput setups can add additional components (for example, RabbitMQ) based on your needs.

Hatchet Cloud and self-hosted Hatchet share the same architecture; the difference is who runs and operates the Hatchet services.

Core components

API server

The API server is the front door to Hatchet. It’s what your application and the Hatchet UI talk to in order to:

trigger workflows with input data
query workflow/task state (and, where supported, subscribe to updates)
manage resources like schedules and settings
ingest webhooks/events (optional)

Engine

The engine is responsible for turning “a workflow should run” into “these tasks are ready and should be executed.” In practice, it:

evaluates workflow dependencies
enforces policies like concurrency limits, rate limits, and priorities
schedules ready tasks and dispatches them to connected workers
records state transitions durably and applies retry/timeout behavior
runs scheduled/cron workflows

Workers connect to the engine over bidirectional gRPC, which allows low-latency dispatch and frequent status updates.

Workers

Workers are your processes. They connect to the engine, receive tasks, run your code, and report progress/results back to Hatchet.

Workers are intentionally flexible: you can run them locally, in containers, or on VMs, and you can scale workers independently from the Hatchet services. You can also run different “types” of workers (and even different languages) depending on what your system needs.

Storage (and optional messaging)

PostgreSQL is the durable store for workflow definitions and execution state (queued/running/completed, inputs/outputs, retries, etc.). In self-hosted deployments, you can start with PostgreSQL-only and add components like RabbitMQ if you need higher throughput.

Guarantees & tradeoffs

Hatchet aims to sit in the middle: more structure than a simple queue, but simpler to operate than a full distributed workflow system.

Good fit for

Workflow orchestration with dependencies, retries, and timeouts
Durable background jobs where “don’t lose work” matters
Moderate to high throughput systems (and a path to higher scale with tuning/sharding). If you’re pushing the limits, contact us.
Multi-language / polyglot workers
Teams already on PostgreSQL who want operational simplicity
Cloud or air-gapped environments (Hatchet Cloud or self-hosting)

Not a good fit for

Extremely high throughput without sharding/custom tuning (for example, sustaining 10,000+ tasks/sec)
Sub-millisecond dispatch latency requirements
In-memory-only queuing where durability is unnecessary
Serverless-only runtimes (e.g. AWS Lambda / Cloud Functions) as your primary worker model

Core reliability guarantees

At-least-once task execution

Hatchet is at least once: tasks are not silently dropped, and failures retry according to your configuration. This also means a task can run more than once, so your task code should be idempotent (or otherwise safe to retry).

Durable state transitions

Workflow state is persisted in PostgreSQL, and state transitions are performed transactionally. This helps keep dependency resolution consistent and makes the system resilient to restarts and transient failures.

Execution policies are explicit

By default, task assignment is FIFO, and you can change execution behavior using:

Stateless services; resilient workers

The engine and API server are designed to restart without losing state, which also enables horizontal scaling by running multiple instances. Workers reconnect after network interruptions and can run close to your services (or close to Hatchet) depending on your latency goals.

Performance expectations

Real-world performance depends heavily on topology (worker ↔ engine network latency), database sizing, and workload shape.

Dispatch latency: often sub-50ms with PostgreSQL-backed storage; in optimized, “hot worker” setups it can be closer to ~25ms P95.
Throughput: varies by setup. PostgreSQL-only deployments often handle hundreds of tasks/sec per engine instance; higher throughput typically requires additional tuning and/or components like RabbitMQ. With tuning and sharding, Hatchet can scale into the high tens of thousands of tasks/sec—contact us if you want to design for that.
Common bottlenecks: DB connection limits, large payloads (e.g. > 1MB), complex dependency graphs, and cross-region latency.

⚠️

Not seeing expected performance?

If you’re not seeing the performance you expect, please reach out to us or join our community to explore tuning options.

Next Steps

Quick Start: set up your first Hatchet worker
Self-Hosting: deploy Hatchet on your own infrastructure

Troubleshooting Workers Cloud vs OSS

We use cookies