Skip to content

Heartbeats & Recovery

When a worker dies mid-task, the task would be stuck forever without detection. Heartbeats solve this:

  1. Workers send periodic heartbeats for their tasks
  2. A reaper checks for missing heartbeats
  3. Stale tasks are automatically recovered

The recovery path depends on the task state:

State when worker disappearsRecovery actionWhy
CLAIMEDrequeue to PENDINGuser code never started
RUNNINGretry or failuser code may have partially run
workflow bookkeeping out of syncreconcile workflow stateparent workflow still needs completion handling

Sent by the worker for CLAIMED tasks:

  • Indicates worker is alive and will soon start the task
  • Sent at claimer_heartbeat_interval_ms interval
  • Covers the gap between claim and execution start

Sent by the spawned tokio task for RUNNING tasks:

  • Indicates task is actively executing
  • Sent at runner_heartbeat_interval_ms interval
  • From a background tokio task spawned alongside the task execution
CLAIMED RUNNING
| |
| Claimer heartbeat | Runner heartbeat
| (from worker loop) | (from spawned task)
| |
+---- HB ----+ +---- HB ----+
| | | |
+---- HB ----+ 30s interval +---- HB ----+ 30s interval
| | | |
+------------+------------------+------------+--->

The reaper periodically checks for stale tasks:

for each task with status = CLAIMED:
last_hb = latest heartbeat(task, role = "claimer")
if now - last_hb > claimed_stale_threshold:
requeue(task) // Safe - code never ran
for each task with status = RUNNING:
last_hb = latest heartbeat(task, role = "runner")
if now - last_hb > running_stale_threshold:
retry_or_fail(task) // Retry if policy allows, otherwise fail

When a CLAIMED task has no recent claimer heartbeat:

  • Safe to requeue: User code never started
  • Task reset to PENDING
  • Another worker will claim it
  • No data corruption risk

This is the cleanest recovery path because execution never began.

When a RUNNING task has no recent runner heartbeat:

  • Not safe to blindly requeue: Code was executing, may have partial side effects
  • If the task has a retry policy with WORKER_CRASHED in auto_retry_for and retries remaining: scheduled for retry (returns to PENDING with next_retry_at)
  • Otherwise: marked as FAILED with WORKER_CRASHED error

This is why idempotent task design still matters: crash recovery can retry work that may have partially completed before the worker died.

When a worker crashes during a workflow task, the reaper marks tasks.status = FAILED, but the worker dies before calling the workflow task completion handler. This leaves workflow_tasks.status stuck in RUNNING while the underlying task is already terminal.

The recovery loop detects this mismatch automatically:

  1. Finds workflow_tasks rows in non-terminal status where the linked tasks row is terminal (COMPLETED, FAILED, or CANCELLED)
  2. Deserializes the TaskResult from the task’s stored result
  3. If no result is stored, synthesizes an error result:
    • WORKER_CRASHED for failed tasks
    • TASK_CANCELLED for cancelled tasks
    • RESULT_NOT_AVAILABLE for completed tasks with missing results
  4. Triggers the normal completion path: updates workflow_tasks status, applies on_error policy, propagates to dependents, and checks workflow completion

This runs before workflow finalization, so dependents are resolved in the same recovery pass.

The worker reaper handles stale task recovery automatically during normal worker operation.

The public Rust API does not expose dedicated helpers for:

  • requeueing stale CLAIMED tasks
  • failing stale RUNNING tasks

For those paths, the supported approach is to let the worker reaper handle them or to use targeted operational SQL if you are doing manual intervention.

For workflow-level reconciliation, there is a separate public helper:

horsies::recover_stuck_workflows(&pool, &registry).await?;
use horsies::{AppConfig, RecoveryConfig};
let config = AppConfig {
recovery: RecoveryConfig {
// Claimer detection
auto_requeue_stale_claimed: true,
claimed_stale_threshold_ms: 120_000, // 2 minutes
claimer_heartbeat_interval_ms: 30_000, // 30 seconds
// Runner detection
auto_fail_stale_running: true,
running_stale_threshold_ms: 300_000, // 5 minutes
runner_heartbeat_interval_ms: 30_000, // 30 seconds
// Check frequency
check_interval_ms: 30_000, // 30 seconds
..Default::default()
},
..AppConfig::for_database_url("postgresql://...")
};

Stale thresholds must be at least 2x the heartbeat interval:

// Valid
RecoveryConfig {
runner_heartbeat_interval_ms: 30_000, // 30s
running_stale_threshold_ms: 60_000, // 60s (2x)
..Default::default()
}
// Invalid - will produce validation error
RecoveryConfig {
runner_heartbeat_interval_ms: 30_000, // 30s
running_stale_threshold_ms: 30_000, // 30s (too tight!)
..Default::default()
}

Long-running blocking tasks (via #[blocking_task]) may delay the heartbeat:

RecoveryConfig {
runner_heartbeat_interval_ms: 60_000, // Heartbeat every minute
running_stale_threshold_ms: 300_000, // 5 minutes before stale
..Default::default()
}

Fast tasks can use tighter detection:

RecoveryConfig {
runner_heartbeat_interval_ms: 10_000, // 10 seconds
running_stale_threshold_ms: 30_000, // 30 seconds
..Default::default()
}

Heartbeats are stored in the horsies_heartbeats table:

ColumnTypeDescription
idintAuto-increment ID
task_idstrTask being tracked
sender_idstrWorker identifier
rolestr’claimer’ or ‘runner’
sent_atdatetimeHeartbeat timestamp
hostnamestrMachine hostname
pidintProcess ID

Not recommended, but possible:

RecoveryConfig {
auto_requeue_stale_claimed: false,
auto_fail_stale_running: false,
..Default::default()
}

Tasks will remain stuck until manually resolved.

Heartbeat cleanup is automatic by default. The worker reaper deletes expired heartbeats on every tick based on RecoveryConfig.heartbeat_retention_hours (default: Some(24), set to None to disable).

For manual cleanup:

-- Delete heartbeats older than 24 hours
DELETE FROM horsies_heartbeats WHERE sent_at < NOW() - INTERVAL '24 hours';

False Positives (Tasks Marked Stale But Running)

Section titled “False Positives (Tasks Marked Stale But Running)”

Increase thresholds:

RecoveryConfig {
running_stale_threshold_ms: 600_000, // 10 minutes
..Default::default()
}

Common causes:

  • Blocking tasks holding the tokio runtime
  • Network latency to database
  • Database contention

Check:

  • auto_requeue_stale_claimed / auto_fail_stale_running enabled?
  • Reaper loop running? (Check worker logs)
  • Database connectivity?