We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.
Learn more.

Self HostingPrometheus Metrics

Prometheus Metrics for Hatchet

This document provides an overview of the Prometheus metrics exposed by Hatchet, setup instructions for the metrics endpoint, and example PromQL queries to analyze them.

Setup

To enable and configure the Prometheus metrics endpoint in your Hatchet server, set the following environment variables (bound to Viper keys as shown):

  • SERVER_PROMETHEUS_ENABLED (prometheus.enabled)

    • Type: boolean
    • Default: false
    • Description: Enables or disables the Prometheus metrics HTTP server.
  • SERVER_PROMETHEUS_ADDRESS (prometheus.address)

    • Type: string
    • Default: ":9090"
    • Description: The network address and port to bind the Prometheus metrics server to.
  • SERVER_PROMETHEUS_PATH (prometheus.path)

    • Type: string
    • Default: "/metrics"
    • Description: The HTTP path at which metrics will be exposed.

If you have set up a Prometheus instance to scrape Hatchet metrics, you can enable the tenant API endpoint by setting the following variables:

  • SERVER_PROMETHEUS_SERVER_URL (prometheus.prometheusServerURL)

    • Type: string
    • Description: The Prometheus server URL.
  • SERVER_PROMETHEUS_SERVER_USERNAME (prometheus.prometheusServerUsername)

    • Type: string
    • Description: The username to access the Prometheus instance via HTTP basic auth.
  • SERVER_PROMETHEUS_SERVER_PASSWORD (prometheus.prometheusServerPassword)

    • Type: string
    • Description: The password to access the Prometheus instance via HTTP basic auth.

Example environment setup:

export SERVER_PROMETHEUS_ENABLED=true
export SERVER_PROMETHEUS_ADDRESS=":9999"
export SERVER_PROMETHEUS_PATH="/custom-metrics"

Restart your Hatchet server after setting these variables to apply the changes.


Global Metrics

Metric NameTypeDescription
hatchet_queue_invocations_totalCounterThe total number of invocations of the queuer function
hatchet_created_tasks_totalCounterThe total number of tasks created
hatchet_retried_tasks_totalCounterThe total number of tasks retried
hatchet_succeeded_tasks_totalCounterThe total number of tasks that succeeded
hatchet_failed_tasks_totalCounterThe total number of tasks that failed (in a final state, not including retries)
hatchet_skipped_tasks_totalCounterThe total number of tasks that were skipped
hatchet_cancelled_tasks_totalCounterThe total number of tasks cancelled
hatchet_assigned_tasks_totalCounterThe total number of tasks assigned to a worker
hatchet_scheduling_timed_out_totalCounterThe total number of tasks that timed out while waiting to be scheduled
hatchet_rate_limited_totalCounterThe total number of tasks that were rate limited
hatchet_queued_to_assigned_totalCounterThe total number of unique tasks that were queued and later assigned to a worker
hatchet_queued_to_assigned_secondsHistogramBuckets of time (in seconds) spent in the queue before being assigned to a worker

Example PromQL Queries

1. Rate of calls to the queuer method
rate(hatchet_queue_invocations_total[5m])
2. Average queue time in milliseconds
# Calculates average queue time over the past 5 minutes, converted to ms
rate(hatchet_queued_to_assigned_seconds_sum[5m])
  / rate(hatchet_queued_to_assigned_seconds_count[5m])
  * 1e3
3. Success and failure rates
rate(hatchet_succeeded_tasks_total[5m])
rate(hatchet_failed_tasks_total[5m])
4. Queue time distribution (histogram)
sum by (le) (
  rate(hatchet_queued_to_assigned_seconds_bucket[5m])
)
5. Rate of tasks created vs. retried
rate(hatchet_created_tasks_total[5m])
rate(hatchet_retried_tasks_total[5m])
6. Task Assignment Rate
rate(hatchet_assigned_tasks_total[5m])
7. Scheduling Timeout Rate
rate(hatchet_scheduling_timed_out_total[5m])
8. Rate Limiting Impact
rate(hatchet_rate_limited_total[5m])
9. Task Completion Ratio (Success vs Total)
rate(hatchet_succeeded_tasks_total[5m])
/
(rate(hatchet_succeeded_tasks_total[5m]) + rate(hatchet_failed_tasks_total[5m]))
10. Task Cancellation Rate
rate(hatchet_cancelled_tasks_total[5m])
11. Task Skip Rate
rate(hatchet_skipped_tasks_total[5m])
12. Queue Processing Efficiency (Assigned vs Created)
rate(hatchet_assigned_tasks_total[5m]) / rate(hatchet_created_tasks_total[5m])

Tenant Metrics

Metric NameTypeDescription
hatchet_tenant_workflow_duration_millisecondsHistogramDuration of workflow execution in milliseconds (DAGs and single tasks)
hatchet_tenant_queue_invocations_totalCounterThe total number of invocations of the queuer function
hatchet_tenant_created_tasks_totalCounterThe total number of tasks created
hatchet_tenant_retried_tasks_totalCounterThe total number of tasks retried
hatchet_tenant_succeeded_tasks_totalCounterThe total number of tasks that succeeded
hatchet_tenant_failed_tasks_totalCounterThe total number of tasks that failed (in a final state, not including retries)
hatchet_tenant_skipped_tasks_totalCounterThe total number of tasks that were skipped
hatchet_tenant_cancelled_tasks_totalCounterThe total number of tasks cancelled
hatchet_tenant_assigned_tasksCounterThe total number of tasks assigned to a worker
hatchet_tenant_scheduling_timed_outCounterThe total number of tasks that timed out while waiting to be scheduled
hatchet_tenant_rate_limitedCounterThe total number of tasks that were rate limited
hatchet_tenant_queued_to_assignedCounterThe total number of unique tasks that were queued and later got assigned to a worker
hatchet_tenant_queued_to_assigned_time_secondsHistogramBuckets of time in seconds spent in the queue before being assigned to a worker

Example PromQL Queries

1. Workflow Duration by Tenant and Status
rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
by (tenant_id, workflow_name, status)
/
rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
by (tenant_id, workflow_name, status)
2. Tenant Queue Performance (95th percentile)
histogram_quantile(0.95,
  rate(hatchet_tenant_queued_to_assigned_time_seconds_bucket[5m])
) by (tenant_id)
3. Tenant Error Rate by Workflow
rate(hatchet_tenant_failed_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
4. Tenant Task Throughput
rate(hatchet_tenant_succeeded_tasks_total[5m]) by (tenant_id)
5. Tenant Retry Rate
rate(hatchet_tenant_retried_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
6. Workflow Duration Distribution by Tenant
sum by (tenant_id, le) (
  rate(hatchet_tenant_workflow_duration_milliseconds_bucket[5m])
)
7. Tenant Rate Limiting Impact
rate(hatchet_tenant_rate_limited[5m]) by (tenant_id)
8. Per-Tenant Queue Utilization
rate(hatchet_tenant_queue_invocations_total[5m]) by (tenant_id)
9. Tenant Scheduling Timeouts
rate(hatchet_tenant_scheduling_timed_out[5m]) by (tenant_id)
10. Tenant Task Assignment Success Rate
rate(hatchet_tenant_assigned_tasks[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)

Cross-Tenant Analysis

Example PromQL Queries

1. Top 5 Tenants by Task Volume
topk(5,
  sum by (tenant_id) (
    rate(hatchet_tenant_created_tasks_total[1h])
  )
)
2. Slowest Workflows Across All Tenants
topk(10,
  rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
  /
  rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
) by (tenant_id, workflow_name)
3. Tenant Resource Consumption Comparison
sum by (tenant_id) (
  rate(hatchet_tenant_workflow_duration_milliseconds_sum[1h])
)
/ 1000 / 60  # Convert to minutes

Integration with Prometheus

This endpoint can be used to configure Prometheus to scrape tenant-specific metrics:

scrape_configs:
  - job_name: "hatchet-tenant-metrics"
    static_configs:
      - targets: ["cloud.onhatchet.run"]
    metrics_path: "/api/v1/tenants/707d0855-80ab-4e1f-a156-f1c4546cbf52/prometheus-metrics"
    scheme: "https"
    authorization:
      credentials: "your-api-token-here"

Note: Replace cloud.onhatchet.run with the URL where your Hatchet instance is hosted.

This provides tenant-isolated metrics that can be scraped directly by Prometheus or consumed by other monitoring tools that support the Prometheus text format.