Recovery Config

Overview

Tasks can become stale when:

Worker process crashes mid-execution
Network partition prevents heartbeats
Worker machine goes down

Horsies automatically detects and recovers these tasks.

Basic Usage

from horsies.core.models.recovery import RecoveryConfig

config = AppConfig(
    broker=PostgresConfig(...),
    recovery=RecoveryConfig(
        auto_requeue_stale_claimed=True,
        claimed_stale_threshold_ms=120_000,
        auto_fail_stale_running=True,
        running_stale_threshold_ms=300_000,
        finalizing_stale_threshold_ms=300_000,
    ),
)

Fields

Field	Type	Default	Description
`auto_requeue_stale_claimed`	`bool`	`True`	Requeue tasks stuck in CLAIMED
`claimed_stale_threshold_ms`	`int`	120,000	Ms before CLAIMED task is stale
`auto_fail_stale_running`	`bool`	`True`	Fail tasks stuck in RUNNING
`running_stale_threshold_ms`	`int`	300,000	Ms before RUNNING task is stale
`finalizing_stale_threshold_ms`	`int`	300,000	Ms a completed child may remain in finalization handoff before recovery
`crashed_worker_recovery_grace_ms`	`int`	10,000	Grace before recovering a workflow task whose underlying task is terminal but whose workflow progression was not applied; `0` disables
`check_interval_ms`	`int`	30,000	How often to check for stale tasks
`runner_heartbeat_interval_ms`	`int`	30,000	RUNNING task heartbeat frequency
`claimer_heartbeat_interval_ms`	`int`	30,000	CLAIMED task heartbeat frequency
`heartbeat_retention_hours`	`int \| None`	24	Hours to keep heartbeat rows; `None` disables pruning
`worker_state_retention_hours`	`int \| None`	168 (7 days)	Hours to keep worker_state snapshots; `None` disables pruning
`terminal_record_retention_hours`	`int \| None`	720 (30 days)	Hours to keep terminal task/workflow rows; `None` disables pruning

All time values for thresholds and intervals are in milliseconds. Retention values are in hours.

Recovery Behaviors

Stale CLAIMED Tasks

When a task is CLAIMED but the claimer heartbeat stops:

Safe to requeue: User code never started executing
Task is reset to PENDING for another worker to claim
Original worker may have crashed before dispatching

Stale RUNNING Tasks

When a regular task is RUNNING but the runner heartbeat stops:

Not safe to blindly requeue: User code was executing, could have partial side effects
If the task has a retry policy with WORKER_CRASHED in auto_retry_for and retries remaining: scheduled for retry (returns to PENDING with next_retry_at)
Otherwise: marked as FAILED with WORKER_CRASHED error

running_stale_threshold_ms is based on missing runner heartbeats, not total task duration. Long tasks remain healthy as long as they continue heartbeating. After user code returns, the child marks the task as finalizing before the parent writes the terminal result. The reaper will not recover that task until finalizing_stale_threshold_ms has also elapsed.

For workflow tasks, the recovery loop also detects when workflow_tasks is stuck non-terminal while the underlying task is already terminal, and triggers the normal completion path. Task finalization is two transactions — the task is marked terminal first, then the workflow DAG is advanced — so a task can be terminal for a brief moment while its workflow progression is still in flight. The reaper waits crashed_worker_recovery_grace_ms (default 10s) before recovering such a task, so it does not race a healthy finalizer; only a genuine crash in that gap is recovered (after the grace plus one reaper sweep). This grace is independent of the heartbeat-coupled thresholds. See Heartbeats & Recovery for details.

Heartbeat System

Two heartbeat types:

Claimer heartbeat: Sent by worker for CLAIMED tasks (not yet running)
Runner heartbeat: Sent by child process for RUNNING tasks

The reaper (running in each worker) checks for missing heartbeats.

Threshold Guidelines

Threshold	Constraint
Stale threshold	Must be >= 2x heartbeat interval
Finalizing stale	Must be >= 2x runner heartbeat interval
Claimed stale	1 second to 1 hour
Running stale	1 second to 2 hours
Check interval	1 second to 10 minutes
Heartbeat intervals	1 second to 2 minutes

For CPU-Heavy Tasks

Long-running CPU tasks may block the heartbeat thread:

RecoveryConfig(
    runner_heartbeat_interval_ms=60_000,    # Heartbeat every minute
    running_stale_threshold_ms=600_000,     # 10 minutes before considered stale
)

For Quick Tasks

Fast tasks can use tighter thresholds:

RecoveryConfig(
    runner_heartbeat_interval_ms=10_000,    # Heartbeat every 10s
    running_stale_threshold_ms=30_000,      # 30s before considered stale
)

Validation

The config validates that thresholds are safe:

# This will raise ValueError:
RecoveryConfig(
    runner_heartbeat_interval_ms=30_000,
    running_stale_threshold_ms=30_000,  # Must be >= 60_000 (2x heartbeat)
)

Retention Cleanup

The reaper loop automatically prunes old rows every hour. Three categories are cleaned independently:

Category	Config field	Default	What gets deleted
Heartbeats	`heartbeat_retention_hours`	24h	`horsies_heartbeats` rows older than threshold
Worker states	`worker_state_retention_hours`	7 days	`horsies_worker_states` snapshots older than threshold
Terminal records	`terminal_record_retention_hours`	30 days	`horsies_tasks`, `horsies_workflows`, and `horsies_workflow_tasks` rows in COMPLETED/FAILED/CANCELLED status older than threshold

Set any field to None to disable pruning for that category.

RecoveryConfig(
    heartbeat_retention_hours=48,              # Keep heartbeats for 2 days
    worker_state_retention_hours=24 * 14,      # Keep worker snapshots for 2 weeks
    terminal_record_retention_hours=24 * 90,   # Keep terminal records for 90 days
)

To disable all automatic cleanup:

RecoveryConfig(
    heartbeat_retention_hours=None,
    worker_state_retention_hours=None,
    terminal_record_retention_hours=None,
)

Disabling Recovery

To disable automatic recovery (not recommended):

RecoveryConfig(
    auto_requeue_stale_claimed=False,
    auto_fail_stale_running=False,
)

Tasks will remain stuck until manually resolved.

Manual Recovery

Query stale tasks via broker:

broker = app.get_broker()

# Find stale tasks
stale = await broker.get_stale_tasks(stale_threshold_minutes=5)

# Manually fail stale RUNNING tasks
failed_count = await broker.mark_stale_tasks_as_failed(stale_threshold_ms=300_000)

# Manually requeue stale CLAIMED tasks
requeued_count = await broker.requeue_stale_claimed(stale_threshold_ms=120_000)