Prometheus Metrics - Hatchet Documentation

Prometheus Metrics for Hatchet

⚠️

Only works with v1 tenants

This document provides an overview of the Prometheus metrics exposed by Hatchet, setup instructions for the metrics endpoint, and example PromQL queries to analyze them.

Setup

To enable and configure the Prometheus metrics endpoint in your Hatchet server, set the following environment variables (bound to Viper keys as shown):

SERVER_PROMETHEUS_ENABLED (prometheus.enabled)
- Type: boolean
- Default: false
- Description: Enables or disables the Prometheus metrics HTTP server.
SERVER_PROMETHEUS_ADDRESS (prometheus.address)
- Type: string
- Default: ":9090"
- Description: The network address and port to bind the Prometheus metrics server to.
SERVER_PROMETHEUS_PATH (prometheus.path)
- Type: string
- Default: "/metrics"
- Description: The HTTP path at which metrics will be exposed.

If you have set up a Prometheus instance to scrape Hatchet metrics, you can enable the tenant API endpoint by setting the following variables:

SERVER_PROMETHEUS_SERVER_URL (prometheus.prometheusServerURL)
- Type: string
- Description: The Prometheus server URL.
SERVER_PROMETHEUS_SERVER_USERNAME (prometheus.prometheusServerUsername)
- Type: string
- Description: The username to access the Prometheus instance via HTTP basic auth.
SERVER_PROMETHEUS_SERVER_PASSWORD (prometheus.prometheusServerPassword)
- Type: string
- Description: The password to access the Prometheus instance via HTTP basic auth.

Example environment setup:

export SERVER_PROMETHEUS_ENABLED=true
export SERVER_PROMETHEUS_ADDRESS=":9999"
export SERVER_PROMETHEUS_PATH="/custom-metrics"

Restart your Hatchet server after setting these variables to apply the changes.

Global Metrics

Metric Name	Type	Description
`hatchet_queue_invocations_total`	Counter	The total number of invocations of the queuer function
`hatchet_created_tasks_total`	Counter	The total number of tasks created
`hatchet_retried_tasks_total`	Counter	The total number of tasks retried
`hatchet_succeeded_tasks_total`	Counter	The total number of tasks that succeeded
`hatchet_failed_tasks_total`	Counter	The total number of tasks that failed (in a final state, not including retries)
`hatchet_skipped_tasks_total`	Counter	The total number of tasks that were skipped
`hatchet_cancelled_tasks_total`	Counter	The total number of tasks cancelled
`hatchet_assigned_tasks_total`	Counter	The total number of tasks assigned to a worker
`hatchet_scheduling_timed_out_total`	Counter	The total number of tasks that timed out while waiting to be scheduled
`hatchet_rate_limited_total`	Counter	The total number of tasks that were rate limited
`hatchet_queued_to_assigned_total`	Counter	The total number of unique tasks that were queued and later assigned to a worker
`hatchet_queued_to_assigned_seconds`	Histogram	Buckets of time (in seconds) spent in the queue before being assigned to a worker
`hatchet_reassigned_tasks_total`	Counter	The total number of tasks that were reassigned to a worker

Example PromQL Queries

1. Rate of calls to the queuer method

rate(hatchet_queue_invocations_total[5m])

2. Average queue time in milliseconds

# Calculates average queue time over the past 5 minutes, converted to ms
rate(hatchet_queued_to_assigned_seconds_sum[5m])
  / rate(hatchet_queued_to_assigned_seconds_count[5m])
  * 1e3

3. Success and failure rates

rate(hatchet_succeeded_tasks_total[5m])
rate(hatchet_failed_tasks_total[5m])

4. Queue time distribution (histogram)

sum by (le) (
  rate(hatchet_queued_to_assigned_seconds_bucket[5m])
)

5. Rate of tasks created vs. retried

rate(hatchet_created_tasks_total[5m])
rate(hatchet_retried_tasks_total[5m])

6. Task Assignment Rate

rate(hatchet_assigned_tasks_total[5m])

7. Scheduling Timeout Rate

rate(hatchet_scheduling_timed_out_total[5m])

8. Rate Limiting Impact

rate(hatchet_rate_limited_total[5m])

9. Task Completion Ratio (Success vs Total)

rate(hatchet_succeeded_tasks_total[5m])
/
(rate(hatchet_succeeded_tasks_total[5m]) + rate(hatchet_failed_tasks_total[5m]))

10. Task Cancellation Rate

rate(hatchet_cancelled_tasks_total[5m])

11. Task Skip Rate

rate(hatchet_skipped_tasks_total[5m])

12. Queue Processing Efficiency (Assigned vs Created)

rate(hatchet_assigned_tasks_total[5m]) / rate(hatchet_created_tasks_total[5m])

13. Task Reassignment Rate

rate(hatchet_reassigned_tasks_total[5m])

Tenant Metrics

Metric Name	Type	Description
`hatchet_tenant_workflow_duration_milliseconds`	Histogram	Duration of workflow execution in milliseconds (DAGs and single tasks)
`hatchet_tenant_queue_invocations_total`	Counter	The total number of invocations of the queuer function
`hatchet_tenant_created_tasks_total`	Counter	The total number of tasks created
`hatchet_tenant_retried_tasks_total`	Counter	The total number of tasks retried
`hatchet_tenant_succeeded_tasks_total`	Counter	The total number of tasks that succeeded
`hatchet_tenant_failed_tasks_total`	Counter	The total number of tasks that failed (in a final state, not including retries)
`hatchet_tenant_skipped_tasks_total`	Counter	The total number of tasks that were skipped
`hatchet_tenant_cancelled_tasks_total`	Counter	The total number of tasks cancelled
`hatchet_tenant_assigned_tasks`	Counter	The total number of tasks assigned to a worker
`hatchet_tenant_scheduling_timed_out`	Counter	The total number of tasks that timed out while waiting to be scheduled
`hatchet_tenant_rate_limited`	Counter	The total number of tasks that were rate limited
`hatchet_tenant_queued_to_assigned`	Counter	The total number of unique tasks that were queued and later got assigned to a worker
`hatchet_tenant_queued_to_assigned_time_seconds`	Histogram	Buckets of time in seconds spent in the queue before being assigned to a worker
`hatchet_tenant_reassigned_tasks`	Counter	The total number of tasks that were reassigned to a worker

Example PromQL Queries

1. Workflow Duration by Tenant and Status

rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
by (tenant_id, workflow_name, status)
/
rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
by (tenant_id, workflow_name, status)

2. Tenant Queue Performance (95th percentile)

histogram_quantile(0.95,
  rate(hatchet_tenant_queued_to_assigned_time_seconds_bucket[5m])
) by (tenant_id)

3. Tenant Error Rate by Workflow

rate(hatchet_tenant_failed_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)

4. Tenant Task Throughput

rate(hatchet_tenant_succeeded_tasks_total[5m]) by (tenant_id)

5. Tenant Retry Rate

rate(hatchet_tenant_retried_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)

6. Workflow Duration Distribution by Tenant

sum by (tenant_id, le) (
  rate(hatchet_tenant_workflow_duration_milliseconds_bucket[5m])
)

7. Tenant Rate Limiting Impact

rate(hatchet_tenant_rate_limited[5m]) by (tenant_id)

8. Per-Tenant Queue Utilization

rate(hatchet_tenant_queue_invocations_total[5m]) by (tenant_id)

9. Tenant Scheduling Timeouts

rate(hatchet_tenant_scheduling_timed_out[5m]) by (tenant_id)

10. Tenant Task Assignment Success Rate

rate(hatchet_tenant_assigned_tasks[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)

11. Tenant Task Reassignment Rate

rate(hatchet_tenant_reassigned_tasks[5m]) by (tenant_id)

Cross-Tenant Analysis

Example PromQL Queries

1. Top 5 Tenants by Task Volume

topk(5,
  sum by (tenant_id) (
    rate(hatchet_tenant_created_tasks_total[1h])
  )
)

2. Slowest Workflows Across All Tenants

topk(10,
  rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
  /
  rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
) by (tenant_id, workflow_name)

3. Tenant Resource Consumption Comparison

sum by (tenant_id) (
  rate(hatchet_tenant_workflow_duration_milliseconds_sum[1h])
)
/ 1000 / 60  # Convert to minutes

Integration with Prometheus

This endpoint can be used to configure Prometheus to scrape tenant-specific metrics:

scrape_configs:
  - job_name: "hatchet-tenant-metrics"
    static_configs:
      - targets: ["cloud.onhatchet.run"]
    metrics_path: "/api/v1/tenants/707d0855-80ab-4e1f-a156-f1c4546cbf52/prometheus-metrics"
    scheme: "https"
    authorization:
      credentials: "your-api-token-here"

Note: Replace cloud.onhatchet.run with the URL where your Hatchet instance is hosted.

This provides tenant-isolated metrics that can be scraped directly by Prometheus or consumed by other monitoring tools that support the Prometheus text format.

Engine Configuration Options Worker Configuration Options

We use cookies

Prometheus Metrics for Hatchet

Setup

Global Metrics

Example PromQL Queries

1. Rate of calls to the queuer method

2. Average queue time in milliseconds

3. Success and failure rates

4. Queue time distribution (histogram)

5. Rate of tasks created vs. retried

6. Task Assignment Rate

7. Scheduling Timeout Rate

8. Rate Limiting Impact

9. Task Completion Ratio (Success vs Total)

10. Task Cancellation Rate

11. Task Skip Rate

12. Queue Processing Efficiency (Assigned vs Created)

13. Task Reassignment Rate

Tenant Metrics

Example PromQL Queries

1. Workflow Duration by Tenant and Status

2. Tenant Queue Performance (95th percentile)

3. Tenant Error Rate by Workflow

4. Tenant Task Throughput

5. Tenant Retry Rate

6. Workflow Duration Distribution by Tenant

7. Tenant Rate Limiting Impact

8. Per-Tenant Queue Utilization

9. Tenant Scheduling Timeouts

10. Tenant Task Assignment Success Rate

11. Tenant Task Reassignment Rate

Cross-Tenant Analysis

Example PromQL Queries

1. Top 5 Tenants by Task Volume

2. Slowest Workflows Across All Tenants

3. Tenant Resource Consumption Comparison

Integration with Prometheus