Prometheus Metrics for Hatchet
This document provides an overview of the Prometheus metrics exposed by Hatchet, setup instructions for the metrics endpoint, and example PromQL queries to analyze them.
Setup
To enable and configure the Prometheus metrics endpoint in your Hatchet server, set the following environment variables (bound to Viper keys as shown):
-
SERVER_PROMETHEUS_ENABLED
(prometheus.enabled
)- Type: boolean
- Default:
false
- Description: Enables or disables the Prometheus metrics HTTP server.
-
SERVER_PROMETHEUS_ADDRESS
(prometheus.address
)- Type: string
- Default:
":9090"
- Description: The network address and port to bind the Prometheus metrics server to.
-
SERVER_PROMETHEUS_PATH
(prometheus.path
)- Type: string
- Default:
"/metrics"
- Description: The HTTP path at which metrics will be exposed.
If you have set up a Prometheus instance to scrape Hatchet metrics, you can enable the tenant API endpoint by setting the following variables:
-
SERVER_PROMETHEUS_SERVER_URL
(prometheus.prometheusServerURL
)- Type: string
- Description: The Prometheus server URL.
-
SERVER_PROMETHEUS_SERVER_USERNAME
(prometheus.prometheusServerUsername
)- Type: string
- Description: The username to access the Prometheus instance via HTTP basic auth.
-
SERVER_PROMETHEUS_SERVER_PASSWORD
(prometheus.prometheusServerPassword
)- Type: string
- Description: The password to access the Prometheus instance via HTTP basic auth.
Example environment setup:
export SERVER_PROMETHEUS_ENABLED=true
export SERVER_PROMETHEUS_ADDRESS=":9999"
export SERVER_PROMETHEUS_PATH="/custom-metrics"
Restart your Hatchet server after setting these variables to apply the changes.
Global Metrics
Metric Name | Type | Description |
---|---|---|
hatchet_queue_invocations_total | Counter | The total number of invocations of the queuer function |
hatchet_created_tasks_total | Counter | The total number of tasks created |
hatchet_retried_tasks_total | Counter | The total number of tasks retried |
hatchet_succeeded_tasks_total | Counter | The total number of tasks that succeeded |
hatchet_failed_tasks_total | Counter | The total number of tasks that failed (in a final state, not including retries) |
hatchet_skipped_tasks_total | Counter | The total number of tasks that were skipped |
hatchet_cancelled_tasks_total | Counter | The total number of tasks cancelled |
hatchet_assigned_tasks_total | Counter | The total number of tasks assigned to a worker |
hatchet_scheduling_timed_out_total | Counter | The total number of tasks that timed out while waiting to be scheduled |
hatchet_rate_limited_total | Counter | The total number of tasks that were rate limited |
hatchet_queued_to_assigned_total | Counter | The total number of unique tasks that were queued and later assigned to a worker |
hatchet_queued_to_assigned_seconds | Histogram | Buckets of time (in seconds) spent in the queue before being assigned to a worker |
Example PromQL Queries
1. Rate of calls to the queuer method
rate(hatchet_queue_invocations_total[5m])
2. Average queue time in milliseconds
# Calculates average queue time over the past 5 minutes, converted to ms
rate(hatchet_queued_to_assigned_seconds_sum[5m])
/ rate(hatchet_queued_to_assigned_seconds_count[5m])
* 1e3
3. Success and failure rates
rate(hatchet_succeeded_tasks_total[5m])
rate(hatchet_failed_tasks_total[5m])
4. Queue time distribution (histogram)
sum by (le) (
rate(hatchet_queued_to_assigned_seconds_bucket[5m])
)
5. Rate of tasks created vs. retried
rate(hatchet_created_tasks_total[5m])
rate(hatchet_retried_tasks_total[5m])
6. Task Assignment Rate
rate(hatchet_assigned_tasks_total[5m])
7. Scheduling Timeout Rate
rate(hatchet_scheduling_timed_out_total[5m])
8. Rate Limiting Impact
rate(hatchet_rate_limited_total[5m])
9. Task Completion Ratio (Success vs Total)
rate(hatchet_succeeded_tasks_total[5m])
/
(rate(hatchet_succeeded_tasks_total[5m]) + rate(hatchet_failed_tasks_total[5m]))
10. Task Cancellation Rate
rate(hatchet_cancelled_tasks_total[5m])
11. Task Skip Rate
rate(hatchet_skipped_tasks_total[5m])
12. Queue Processing Efficiency (Assigned vs Created)
rate(hatchet_assigned_tasks_total[5m]) / rate(hatchet_created_tasks_total[5m])
Tenant Metrics
Metric Name | Type | Description |
---|---|---|
hatchet_tenant_workflow_duration_milliseconds | Histogram | Duration of workflow execution in milliseconds (DAGs and single tasks) |
hatchet_tenant_queue_invocations_total | Counter | The total number of invocations of the queuer function |
hatchet_tenant_created_tasks_total | Counter | The total number of tasks created |
hatchet_tenant_retried_tasks_total | Counter | The total number of tasks retried |
hatchet_tenant_succeeded_tasks_total | Counter | The total number of tasks that succeeded |
hatchet_tenant_failed_tasks_total | Counter | The total number of tasks that failed (in a final state, not including retries) |
hatchet_tenant_skipped_tasks_total | Counter | The total number of tasks that were skipped |
hatchet_tenant_cancelled_tasks_total | Counter | The total number of tasks cancelled |
hatchet_tenant_assigned_tasks | Counter | The total number of tasks assigned to a worker |
hatchet_tenant_scheduling_timed_out | Counter | The total number of tasks that timed out while waiting to be scheduled |
hatchet_tenant_rate_limited | Counter | The total number of tasks that were rate limited |
hatchet_tenant_queued_to_assigned | Counter | The total number of unique tasks that were queued and later got assigned to a worker |
hatchet_tenant_queued_to_assigned_time_seconds | Histogram | Buckets of time in seconds spent in the queue before being assigned to a worker |
Example PromQL Queries
1. Workflow Duration by Tenant and Status
rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
by (tenant_id, workflow_name, status)
/
rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
by (tenant_id, workflow_name, status)
2. Tenant Queue Performance (95th percentile)
histogram_quantile(0.95,
rate(hatchet_tenant_queued_to_assigned_time_seconds_bucket[5m])
) by (tenant_id)
3. Tenant Error Rate by Workflow
rate(hatchet_tenant_failed_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
4. Tenant Task Throughput
rate(hatchet_tenant_succeeded_tasks_total[5m]) by (tenant_id)
5. Tenant Retry Rate
rate(hatchet_tenant_retried_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
6. Workflow Duration Distribution by Tenant
sum by (tenant_id, le) (
rate(hatchet_tenant_workflow_duration_milliseconds_bucket[5m])
)
7. Tenant Rate Limiting Impact
rate(hatchet_tenant_rate_limited[5m]) by (tenant_id)
8. Per-Tenant Queue Utilization
rate(hatchet_tenant_queue_invocations_total[5m]) by (tenant_id)
9. Tenant Scheduling Timeouts
rate(hatchet_tenant_scheduling_timed_out[5m]) by (tenant_id)
10. Tenant Task Assignment Success Rate
rate(hatchet_tenant_assigned_tasks[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
Cross-Tenant Analysis
Example PromQL Queries
1. Top 5 Tenants by Task Volume
topk(5,
sum by (tenant_id) (
rate(hatchet_tenant_created_tasks_total[1h])
)
)
2. Slowest Workflows Across All Tenants
topk(10,
rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
/
rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
) by (tenant_id, workflow_name)
3. Tenant Resource Consumption Comparison
sum by (tenant_id) (
rate(hatchet_tenant_workflow_duration_milliseconds_sum[1h])
)
/ 1000 / 60 # Convert to minutes
Integration with Prometheus
This endpoint can be used to configure Prometheus to scrape tenant-specific metrics:
scrape_configs:
- job_name: "hatchet-tenant-metrics"
static_configs:
- targets: ["cloud.onhatchet.run"]
metrics_path: "/api/v1/tenants/707d0855-80ab-4e1f-a156-f1c4546cbf52/prometheus-metrics"
scheme: "https"
authorization:
credentials: "your-api-token-here"
Note: Replace cloud.onhatchet.run
with the URL where your Hatchet instance is hosted.
This provides tenant-isolated metrics that can be scraped directly by Prometheus or consumed by other monitoring tools that support the Prometheus text format.