Recovery Config
Overview
Section titled “Overview”Tasks can become stale when:
- Worker process crashes mid-execution
- Network partition prevents heartbeats
- Worker machine goes down
Horsies automatically detects and recovers these tasks.
Basic Usage
Section titled “Basic Usage”from horsies.core.models.recovery import RecoveryConfig
config = AppConfig( broker=PostgresConfig(...), recovery=RecoveryConfig( auto_requeue_stale_claimed=True, claimed_stale_threshold_ms=120_000, auto_fail_stale_running=True, running_stale_threshold_ms=300_000, finalizing_stale_threshold_ms=300_000, ),)Fields
Section titled “Fields”| Field | Type | Default | Description |
|---|---|---|---|
auto_requeue_stale_claimed | bool | True | Requeue tasks stuck in CLAIMED |
claimed_stale_threshold_ms | int | 120,000 | Ms before CLAIMED task is stale |
auto_fail_stale_running | bool | True | Fail tasks stuck in RUNNING |
running_stale_threshold_ms | int | 300,000 | Ms before RUNNING task is stale |
finalizing_stale_threshold_ms | int | 300,000 | Ms a completed child may remain in finalization handoff before recovery |
crashed_worker_recovery_grace_ms | int | 10,000 | Grace before recovering a workflow task whose underlying task is terminal but whose workflow progression was not applied; 0 disables |
check_interval_ms | int | 30,000 | How often to check for stale tasks |
runner_heartbeat_interval_ms | int | 30,000 | RUNNING task heartbeat frequency |
claimer_heartbeat_interval_ms | int | 30,000 | CLAIMED task heartbeat frequency |
heartbeat_retention_hours | int | None | 24 | Hours to keep heartbeat rows; None disables pruning |
worker_state_retention_hours | int | None | 168 (7 days) | Hours to keep worker_state snapshots; None disables pruning |
terminal_record_retention_hours | int | None | 720 (30 days) | Hours to keep terminal task/workflow rows; None disables pruning |
All time values for thresholds and intervals are in milliseconds. Retention values are in hours.
Recovery Behaviors
Section titled “Recovery Behaviors”Stale CLAIMED Tasks
Section titled “Stale CLAIMED Tasks”When a task is CLAIMED but the claimer heartbeat stops:
- Safe to requeue: User code never started executing
- Task is reset to PENDING for another worker to claim
- Original worker may have crashed before dispatching
Stale RUNNING Tasks
Section titled “Stale RUNNING Tasks”When a regular task is RUNNING but the runner heartbeat stops:
- Not safe to blindly requeue: User code was executing, could have partial side effects
- If the task has a retry policy with
WORKER_CRASHEDinauto_retry_forand retries remaining: scheduled for retry (returns to PENDING withnext_retry_at) - Otherwise: marked as FAILED with
WORKER_CRASHEDerror
running_stale_threshold_ms is based on missing runner heartbeats, not total
task duration. Long tasks remain healthy as long as they continue heartbeating.
After user code returns, the child marks the task as finalizing before the
parent writes the terminal result. The reaper will not recover that task until
finalizing_stale_threshold_ms has also elapsed.
For workflow tasks, the recovery loop also detects when workflow_tasks is stuck non-terminal while the underlying task is already terminal, and triggers the normal completion path. Task finalization is two transactions — the task is marked terminal first, then the workflow DAG is advanced — so a task can be terminal for a brief moment while its workflow progression is still in flight. The reaper waits crashed_worker_recovery_grace_ms (default 10s) before recovering such a task, so it does not race a healthy finalizer; only a genuine crash in that gap is recovered (after the grace plus one reaper sweep). This grace is independent of the heartbeat-coupled thresholds. See Heartbeats & Recovery for details.
Heartbeat System
Section titled “Heartbeat System”Two heartbeat types:
- Claimer heartbeat: Sent by worker for CLAIMED tasks (not yet running)
- Runner heartbeat: Sent by child process for RUNNING tasks
The reaper (running in each worker) checks for missing heartbeats.
Threshold Guidelines
Section titled “Threshold Guidelines”| Threshold | Constraint |
|---|---|
| Stale threshold | Must be >= 2x heartbeat interval |
| Finalizing stale | Must be >= 2x runner heartbeat interval |
| Claimed stale | 1 second to 1 hour |
| Running stale | 1 second to 2 hours |
| Check interval | 1 second to 10 minutes |
| Heartbeat intervals | 1 second to 2 minutes |
For CPU-Heavy Tasks
Section titled “For CPU-Heavy Tasks”Long-running CPU tasks may block the heartbeat thread:
RecoveryConfig( runner_heartbeat_interval_ms=60_000, # Heartbeat every minute running_stale_threshold_ms=600_000, # 10 minutes before considered stale)For Quick Tasks
Section titled “For Quick Tasks”Fast tasks can use tighter thresholds:
RecoveryConfig( runner_heartbeat_interval_ms=10_000, # Heartbeat every 10s running_stale_threshold_ms=30_000, # 30s before considered stale)Validation
Section titled “Validation”The config validates that thresholds are safe:
# This will raise ValueError:RecoveryConfig( runner_heartbeat_interval_ms=30_000, running_stale_threshold_ms=30_000, # Must be >= 60_000 (2x heartbeat))Retention Cleanup
Section titled “Retention Cleanup”The reaper loop automatically prunes old rows every hour. Three categories are cleaned independently:
| Category | Config field | Default | What gets deleted |
|---|---|---|---|
| Heartbeats | heartbeat_retention_hours | 24h | horsies_heartbeats rows older than threshold |
| Worker states | worker_state_retention_hours | 7 days | horsies_worker_states snapshots older than threshold |
| Terminal records | terminal_record_retention_hours | 30 days | horsies_tasks, horsies_workflows, and horsies_workflow_tasks rows in COMPLETED/FAILED/CANCELLED status older than threshold |
Set any field to None to disable pruning for that category.
RecoveryConfig( heartbeat_retention_hours=48, # Keep heartbeats for 2 days worker_state_retention_hours=24 * 14, # Keep worker snapshots for 2 weeks terminal_record_retention_hours=24 * 90, # Keep terminal records for 90 days)To disable all automatic cleanup:
RecoveryConfig( heartbeat_retention_hours=None, worker_state_retention_hours=None, terminal_record_retention_hours=None,)Disabling Recovery
Section titled “Disabling Recovery”To disable automatic recovery (not recommended):
RecoveryConfig( auto_requeue_stale_claimed=False, auto_fail_stale_running=False,)Tasks will remain stuck until manually resolved.
Manual Recovery
Section titled “Manual Recovery”Query stale tasks via broker:
broker = app.get_broker()
# Find stale tasksstale = await broker.get_stale_tasks(stale_threshold_minutes=5)
# Manually fail stale RUNNING tasksfailed_count = await broker.mark_stale_tasks_as_failed(stale_threshold_ms=300_000)
# Manually requeue stale CLAIMED tasksrequeued_count = await broker.requeue_stale_claimed(stale_threshold_ms=120_000)