Skip to content

Recovery Config

Tasks can become stale when:

  • Worker process crashes mid-execution
  • Network partition prevents heartbeats
  • Worker machine goes down

Horsies automatically detects and recovers these tasks.

from horsies.core.models.recovery import RecoveryConfig
config = AppConfig(
broker=PostgresConfig(...),
recovery=RecoveryConfig(
auto_requeue_stale_claimed=True,
claimed_stale_threshold_ms=120_000,
auto_fail_stale_running=True,
running_stale_threshold_ms=300_000,
finalizing_stale_threshold_ms=300_000,
),
)
FieldTypeDefaultDescription
auto_requeue_stale_claimedboolTrueRequeue tasks stuck in CLAIMED
claimed_stale_threshold_msint120,000Ms before CLAIMED task is stale
auto_fail_stale_runningboolTrueFail tasks stuck in RUNNING
running_stale_threshold_msint300,000Ms before RUNNING task is stale
finalizing_stale_threshold_msint300,000Ms a completed child may remain in finalization handoff before recovery
crashed_worker_recovery_grace_msint10,000Grace before recovering a workflow task whose underlying task is terminal but whose workflow progression was not applied; 0 disables
check_interval_msint30,000How often to check for stale tasks
runner_heartbeat_interval_msint30,000RUNNING task heartbeat frequency
claimer_heartbeat_interval_msint30,000CLAIMED task heartbeat frequency
heartbeat_retention_hoursint | None24Hours to keep heartbeat rows; None disables pruning
worker_state_retention_hoursint | None168 (7 days)Hours to keep worker_state snapshots; None disables pruning
terminal_record_retention_hoursint | None720 (30 days)Hours to keep terminal task/workflow rows; None disables pruning

All time values for thresholds and intervals are in milliseconds. Retention values are in hours.

When a task is CLAIMED but the claimer heartbeat stops:

  • Safe to requeue: User code never started executing
  • Task is reset to PENDING for another worker to claim
  • Original worker may have crashed before dispatching

When a regular task is RUNNING but the runner heartbeat stops:

  • Not safe to blindly requeue: User code was executing, could have partial side effects
  • If the task has a retry policy with WORKER_CRASHED in auto_retry_for and retries remaining: scheduled for retry (returns to PENDING with next_retry_at)
  • Otherwise: marked as FAILED with WORKER_CRASHED error

running_stale_threshold_ms is based on missing runner heartbeats, not total task duration. Long tasks remain healthy as long as they continue heartbeating. After user code returns, the child marks the task as finalizing before the parent writes the terminal result. The reaper will not recover that task until finalizing_stale_threshold_ms has also elapsed.

For workflow tasks, the recovery loop also detects when workflow_tasks is stuck non-terminal while the underlying task is already terminal, and triggers the normal completion path. Task finalization is two transactions — the task is marked terminal first, then the workflow DAG is advanced — so a task can be terminal for a brief moment while its workflow progression is still in flight. The reaper waits crashed_worker_recovery_grace_ms (default 10s) before recovering such a task, so it does not race a healthy finalizer; only a genuine crash in that gap is recovered (after the grace plus one reaper sweep). This grace is independent of the heartbeat-coupled thresholds. See Heartbeats & Recovery for details.

Two heartbeat types:

  1. Claimer heartbeat: Sent by worker for CLAIMED tasks (not yet running)
  2. Runner heartbeat: Sent by child process for RUNNING tasks

The reaper (running in each worker) checks for missing heartbeats.

ThresholdConstraint
Stale thresholdMust be >= 2x heartbeat interval
Finalizing staleMust be >= 2x runner heartbeat interval
Claimed stale1 second to 1 hour
Running stale1 second to 2 hours
Check interval1 second to 10 minutes
Heartbeat intervals1 second to 2 minutes

Long-running CPU tasks may block the heartbeat thread:

RecoveryConfig(
runner_heartbeat_interval_ms=60_000, # Heartbeat every minute
running_stale_threshold_ms=600_000, # 10 minutes before considered stale
)

Fast tasks can use tighter thresholds:

RecoveryConfig(
runner_heartbeat_interval_ms=10_000, # Heartbeat every 10s
running_stale_threshold_ms=30_000, # 30s before considered stale
)

The config validates that thresholds are safe:

# This will raise ValueError:
RecoveryConfig(
runner_heartbeat_interval_ms=30_000,
running_stale_threshold_ms=30_000, # Must be >= 60_000 (2x heartbeat)
)

The reaper loop automatically prunes old rows every hour. Three categories are cleaned independently:

CategoryConfig fieldDefaultWhat gets deleted
Heartbeatsheartbeat_retention_hours24hhorsies_heartbeats rows older than threshold
Worker statesworker_state_retention_hours7 dayshorsies_worker_states snapshots older than threshold
Terminal recordsterminal_record_retention_hours30 dayshorsies_tasks, horsies_workflows, and horsies_workflow_tasks rows in COMPLETED/FAILED/CANCELLED status older than threshold

Set any field to None to disable pruning for that category.

RecoveryConfig(
heartbeat_retention_hours=48, # Keep heartbeats for 2 days
worker_state_retention_hours=24 * 14, # Keep worker snapshots for 2 weeks
terminal_record_retention_hours=24 * 90, # Keep terminal records for 90 days
)

To disable all automatic cleanup:

RecoveryConfig(
heartbeat_retention_hours=None,
worker_state_retention_hours=None,
terminal_record_retention_hours=None,
)

To disable automatic recovery (not recommended):

RecoveryConfig(
auto_requeue_stale_claimed=False,
auto_fail_stale_running=False,
)

Tasks will remain stuck until manually resolved.

Query stale tasks via broker:

broker = app.get_broker()
# Find stale tasks
stale = await broker.get_stale_tasks(stale_threshold_minutes=5)
# Manually fail stale RUNNING tasks
failed_count = await broker.mark_stale_tasks_as_failed(stale_threshold_ms=300_000)
# Manually requeue stale CLAIMED tasks
requeued_count = await broker.requeue_stale_claimed(stale_threshold_ms=120_000)