Skip to content

Recovery Config

Tasks can become stale when:

  • Worker process crashes mid-execution
  • Network partition prevents heartbeats
  • Worker machine goes down

Horsies automatically detects and recovers these tasks.

from horsies.core.models.recovery import RecoveryConfig
config = AppConfig(
broker=PostgresConfig(...),
recovery=RecoveryConfig(
auto_requeue_stale_claimed=True,
claimed_stale_threshold_ms=120_000,
auto_fail_stale_running=True,
running_stale_threshold_ms=300_000,
),
)
FieldTypeDefaultDescription
auto_requeue_stale_claimedboolTrueRequeue tasks stuck in CLAIMED
claimed_stale_threshold_msint120,000Ms before CLAIMED task is stale
auto_fail_stale_runningboolTrueFail tasks stuck in RUNNING
running_stale_threshold_msint300,000Ms before RUNNING task is stale
check_interval_msint30,000How often to check for stale tasks
runner_heartbeat_interval_msint30,000RUNNING task heartbeat frequency
claimer_heartbeat_interval_msint30,000CLAIMED task heartbeat frequency
heartbeat_retention_hoursint | None24Hours to keep heartbeat rows; None disables pruning
worker_state_retention_hoursint | None168 (7 days)Hours to keep worker_state snapshots; None disables pruning
terminal_record_retention_hoursint | None720 (30 days)Hours to keep terminal task/workflow rows; None disables pruning

All time values for thresholds and intervals are in milliseconds. Retention values are in hours.

When a task is CLAIMED but the claimer heartbeat stops:

  • Safe to requeue: User code never started executing
  • Task is reset to PENDING for another worker to claim
  • Original worker may have crashed before dispatching

When a regular task is RUNNING but the runner heartbeat stops:

  • Not safe to blindly requeue: User code was executing, could have partial side effects
  • If the task has a retry policy with WORKER_CRASHED in auto_retry_for and retries remaining: scheduled for retry (returns to PENDING with next_retry_at)
  • Otherwise: marked as FAILED with WORKER_CRASHED error

For workflow tasks, the recovery loop also detects when workflow_tasks is stuck non-terminal while the underlying task is already terminal, and triggers the normal completion path. See Heartbeats & Recovery for details.

Two heartbeat types:

  1. Claimer heartbeat: Sent by worker for CLAIMED tasks (not yet running)
  2. Runner heartbeat: Sent by child process for RUNNING tasks

The reaper (running in each worker) checks for missing heartbeats.

ThresholdConstraint
Stale thresholdMust be >= 2x heartbeat interval
Claimed stale1 second to 1 hour
Running stale1 second to 2 hours
Check interval1 second to 10 minutes
Heartbeat intervals1 second to 2 minutes

Long-running CPU tasks may block the heartbeat thread:

RecoveryConfig(
runner_heartbeat_interval_ms=60_000, # Heartbeat every minute
running_stale_threshold_ms=600_000, # 10 minutes before considered stale
)

Fast tasks can use tighter thresholds:

RecoveryConfig(
runner_heartbeat_interval_ms=10_000, # Heartbeat every 10s
running_stale_threshold_ms=30_000, # 30s before considered stale
)

The config validates that thresholds are safe:

# This will raise ValueError:
RecoveryConfig(
runner_heartbeat_interval_ms=30_000,
running_stale_threshold_ms=30_000, # Must be >= 60_000 (2x heartbeat)
)

The reaper loop automatically prunes old rows every hour. Three categories are cleaned independently:

CategoryConfig fieldDefaultWhat gets deleted
Heartbeatsheartbeat_retention_hours24hhorsies_heartbeats rows older than threshold
Worker statesworker_state_retention_hours7 dayshorsies_worker_states snapshots older than threshold
Terminal recordsterminal_record_retention_hours30 dayshorsies_tasks, horsies_workflows, and horsies_workflow_tasks rows in COMPLETED/FAILED/CANCELLED status older than threshold

Set any field to None to disable pruning for that category.

RecoveryConfig(
heartbeat_retention_hours=48, # Keep heartbeats for 2 days
worker_state_retention_hours=24 * 14, # Keep worker snapshots for 2 weeks
terminal_record_retention_hours=24 * 90, # Keep terminal records for 90 days
)

To disable all automatic cleanup:

RecoveryConfig(
heartbeat_retention_hours=None,
worker_state_retention_hours=None,
terminal_record_retention_hours=None,
)

To disable automatic recovery (not recommended):

RecoveryConfig(
auto_requeue_stale_claimed=False,
auto_fail_stale_running=False,
)

Tasks will remain stuck until manually resolved.

Query stale tasks via broker:

broker = app.get_broker()
# Find stale tasks
stale = await broker.get_stale_tasks(stale_threshold_minutes=5)
# Manually fail stale RUNNING tasks
failed_count = await broker.mark_stale_tasks_as_failed(stale_threshold_ms=300_000)
# Manually requeue stale CLAIMED tasks
requeued_count = await broker.requeue_stale_claimed(stale_threshold_ms=120_000)