Skip to content

Recovery Config

Tasks can become stale when:

  • Worker process crashes mid-execution
  • Network partition prevents heartbeats
  • Worker machine goes down

Horsies automatically detects and recovers these tasks.

use horsies::{AppConfig, RecoveryConfig};
let config = AppConfig {
recovery: RecoveryConfig {
auto_requeue_stale_claimed: true,
claimed_stale_threshold_ms: 120_000,
auto_fail_stale_running: true,
running_stale_threshold_ms: 300_000,
..RecoveryConfig::default()
},
..AppConfig::for_database_url("postgresql://...")
};
FieldTypeDefaultDescription
auto_requeue_stale_claimedbooltrueRequeue tasks stuck in CLAIMED
claimed_stale_threshold_msu64120,000Ms before CLAIMED task is stale
auto_fail_stale_runningbooltrueFail tasks stuck in RUNNING
running_stale_threshold_msu64300,000Ms before RUNNING task is stale
check_interval_msu6430,000How often to check for stale tasks
runner_heartbeat_interval_msu6430,000RUNNING task heartbeat frequency
claimer_heartbeat_interval_msu6430,000CLAIMED task heartbeat frequency
heartbeat_retention_hoursOption<u32>Some(24)Hours to keep heartbeat rows; None disables pruning
worker_state_retention_hoursOption<u32>Some(168) (7 days)Hours to keep worker_state snapshots; None disables pruning
terminal_record_retention_hoursOption<u32>Some(720) (30 days)Hours to keep terminal task/workflow rows; None disables pruning

All time values for thresholds and intervals are in milliseconds. Retention values are in hours.

When a task is CLAIMED but the claimer heartbeat stops:

  • Safe to requeue: User code never started executing
  • Task is reset to PENDING for another worker to claim
  • Original worker may have crashed before dispatching

When a regular task is RUNNING but the runner heartbeat stops:

  • Not safe to blindly requeue: User code was executing, could have partial side effects
  • If the task has a retry policy with WORKER_CRASHED in auto_retry_for and retries remaining: scheduled for retry (returns to PENDING with next_retry_at)
  • Otherwise: marked as FAILED with WORKER_CRASHED error

For workflow tasks, the recovery loop also detects when workflow_tasks is stuck non-terminal while the underlying task is already terminal, and triggers the normal completion path. See Heartbeats & Recovery for details.

Two heartbeat types:

  1. Claimer heartbeat: Sent by the worker for CLAIMED tasks (not yet running)
  2. Runner heartbeat: Sent by the spawned task for RUNNING tasks

The reaper (running as a tokio task in each worker) checks for missing heartbeats.

ThresholdConstraint
Stale thresholdMust be >= 2x heartbeat interval
Claimed stale1 second to 1 hour
Running stale1 second to 2 hours
Check interval1 second to 10 minutes
Heartbeat intervals1 second to 2 minutes

Long-running blocking tasks may delay the heartbeat:

RecoveryConfig {
runner_heartbeat_interval_ms: 60_000, // Heartbeat every minute
running_stale_threshold_ms: 600_000, // 10 minutes before considered stale
..Default::default()
}

Fast tasks can use tighter thresholds:

RecoveryConfig {
runner_heartbeat_interval_ms: 10_000, // Heartbeat every 10s
running_stale_threshold_ms: 30_000, // 30s before considered stale
..Default::default()
}

The config validates that thresholds are safe:

// This will produce a validation error:
RecoveryConfig {
runner_heartbeat_interval_ms: 30_000,
running_stale_threshold_ms: 30_000, // Must be >= 60_000 (2x heartbeat)
..Default::default()
}

The reaper loop automatically prunes old rows every hour. Three categories are cleaned independently:

CategoryConfig fieldDefaultWhat gets deleted
Heartbeatsheartbeat_retention_hours24hhorsies_heartbeats rows older than threshold
Worker statesworker_state_retention_hours7 dayshorsies_worker_states snapshots older than threshold
Terminal recordsterminal_record_retention_hours30 dayshorsies_tasks, horsies_workflows, and horsies_workflow_tasks rows in COMPLETED/FAILED/CANCELLED status older than threshold

Set any field to None to disable pruning for that category.

RecoveryConfig {
heartbeat_retention_hours: Some(48), // Keep heartbeats for 2 days
worker_state_retention_hours: Some(24 * 14), // Keep worker snapshots for 2 weeks
terminal_record_retention_hours: Some(24 * 90), // Keep terminal records for 90 days
..Default::default()
}

To disable all automatic cleanup:

RecoveryConfig {
heartbeat_retention_hours: None,
worker_state_retention_hours: None,
terminal_record_retention_hours: None,
..Default::default()
}

To disable automatic recovery (not recommended):

RecoveryConfig {
auto_requeue_stale_claimed: false,
auto_fail_stale_running: false,
..Default::default()
}

Tasks will remain stuck until manually resolved.

For stale CLAIMED and stale RUNNING tasks, Rust does not expose dedicated public recovery helpers on Horsies or PostgresBroker.

The supported paths are:

  • let the worker reaper handle stale-task recovery automatically
  • use targeted operational SQL for manual intervention

For workflow-level reconciliation, there is a separate public helper:

horsies::recover_stuck_workflows(&pool, &registry).await?;