Heartbeats & Recovery
Overview
Section titled “Overview”When a worker dies mid-task, the task would be stuck forever without detection. Heartbeats solve this:
- Workers send periodic heartbeats for their tasks
- A reaper checks for missing heartbeats
- Stale tasks are automatically recovered
Heartbeat Types
Section titled “Heartbeat Types”Claimer Heartbeat
Section titled “Claimer Heartbeat”Sent by the main worker process for CLAIMED tasks:
- Indicates worker is alive and will soon start the task
- Sent at
claimer_heartbeat_interval_msinterval - Covers the gap between claim and execution start
Runner Heartbeat
Section titled “Runner Heartbeat”Sent by the child process for RUNNING tasks:
- Indicates task is actively executing
- Sent at
runner_heartbeat_interval_msinterval - From a separate thread within the task process
Heartbeat Flow
Section titled “Heartbeat Flow”CLAIMED RUNNING │ │ │ Claimer heartbeat │ Runner heartbeat │ (from main process) │ (from task process) │ │ ├──── HB ────┐ ├──── HB ────┐ │ │ │ │ ├──── HB ────┤ 30s interval ├──── HB ────┤ 30s interval │ │ │ │ └────────────┴──────────────────┴────────────┴───>Stale Detection
Section titled “Stale Detection”The reaper periodically checks for stale tasks:
# Simplified logicfor task in tasks.filter(status='CLAIMED'): last_hb = get_latest_heartbeat(task, role='claimer') if now - last_hb > claimed_stale_threshold: requeue(task) # Safe - code never ran
for task in tasks.filter(status='RUNNING'): last_hb = get_latest_heartbeat(task, role='runner') if now - last_hb > running_stale_threshold: fail(task) # Not safe to requeueRecovery Actions
Section titled “Recovery Actions”Stale CLAIMED → PENDING
Section titled “Stale CLAIMED → PENDING”When a CLAIMED task has no recent claimer heartbeat:
- Safe to requeue: User code never started
- Task reset to PENDING
- Another worker will claim it
- No data corruption risk
Stale RUNNING Recovery
Section titled “Stale RUNNING Recovery”When a RUNNING task has no recent runner heartbeat:
- Not safe to blindly requeue: Code was executing, may have partial side effects
- If the task has a retry policy with
WORKER_CRASHEDinauto_retry_forand retries remaining: scheduled for retry (returns to PENDING withnext_retry_at) - Otherwise: marked as FAILED with
WORKER_CRASHEDerror
Workflow Task Recovery
Section titled “Workflow Task Recovery”When a worker crashes during a workflow task, the reaper marks tasks.status = FAILED, but the worker dies before calling on_workflow_task_complete(). This leaves workflow_tasks.status stuck in RUNNING while the underlying task is already terminal.
The recovery loop detects this mismatch automatically:
- Finds
workflow_tasksrows in non-terminal status where the linkedtasksrow is terminal (COMPLETED,FAILED, orCANCELLED) - Deserializes the
TaskResultfrom the task’s stored result - If no result is stored, synthesizes an error result:
WORKER_CRASHEDfor failed tasksTASK_CANCELLEDfor cancelled tasksRESULT_NOT_AVAILABLEfor completed tasks with missing results
- Triggers the normal completion path: updates
workflow_tasksstatus, applieson_errorpolicy, propagates to dependents, and checks workflow completion
This runs before workflow finalization, so dependents are resolved in the same recovery pass.
Configuration
Section titled “Configuration”from horsies.core.models.recovery import RecoveryConfig
config = AppConfig( broker=PostgresConfig(...), recovery=RecoveryConfig( # Claimer detection auto_requeue_stale_claimed=True, claimed_stale_threshold_ms=120_000, # 2 minutes claimer_heartbeat_interval_ms=30_000, # 30 seconds
# Runner detection auto_fail_stale_running=True, running_stale_threshold_ms=300_000, # 5 minutes runner_heartbeat_interval_ms=30_000, # 30 seconds
# Check frequency check_interval_ms=30_000, # 30 seconds ),)Timing Guidelines
Section titled “Timing Guidelines”Rule: Threshold >= 2x Interval
Section titled “Rule: Threshold >= 2x Interval”Stale thresholds must be at least 2x the heartbeat interval:
# ValidRecoveryConfig( runner_heartbeat_interval_ms=30_000, # 30s running_stale_threshold_ms=60_000, # 60s (2x))
# Invalid - will raise ValueErrorRecoveryConfig( runner_heartbeat_interval_ms=30_000, # 30s running_stale_threshold_ms=30_000, # 30s (too tight!))For CPU-Heavy Tasks
Section titled “For CPU-Heavy Tasks”Long-running CPU tasks may block the heartbeat thread:
RecoveryConfig( runner_heartbeat_interval_ms=60_000, # Heartbeat every minute running_stale_threshold_ms=300_000, # 5 minutes before stale)For Quick Tasks
Section titled “For Quick Tasks”Fast tasks can use tighter detection:
RecoveryConfig( runner_heartbeat_interval_ms=10_000, # 10 seconds running_stale_threshold_ms=30_000, # 30 seconds)Database Schema
Section titled “Database Schema”Heartbeats are stored in the horsies_heartbeats table:
| Column | Type | Description |
|---|---|---|
id | int | Auto-increment ID |
task_id | str | Task being tracked |
sender_id | str | Worker/process identifier |
role | str | ’claimer’ or ‘runner’ |
sent_at | datetime | Heartbeat timestamp |
hostname | str | Machine hostname |
pid | int | Process ID |
Manual Recovery
Section titled “Manual Recovery”Query stale tasks:
broker = app.get_broker()
# Find stale RUNNING tasksstale = await broker.get_stale_tasks(stale_threshold_minutes=5)for task in stale: print(f"Stale: {task['id']} on {task['worker_hostname']}")Force recovery:
# Manually fail stale RUNNINGfailed = await broker.mark_stale_tasks_as_failed(stale_threshold_ms=300_000)print(f"Failed {failed} stale tasks")
# Manually requeue stale CLAIMEDrequeued = await broker.requeue_stale_claimed(stale_threshold_ms=120_000)print(f"Requeued {requeued} stale tasks")Disabling Recovery
Section titled “Disabling Recovery”Not recommended, but possible:
RecoveryConfig( auto_requeue_stale_claimed=False, auto_fail_stale_running=False,)Tasks will remain stuck until manually resolved.
Table Cleanup
Section titled “Table Cleanup”Heartbeat cleanup is automatic by default. The worker reaper deletes expired heartbeats on every tick based on RecoveryConfig.heartbeat_retention_hours (default: 24, set to None to disable).
For manual cleanup:
-- Delete heartbeats older than 24 hoursDELETE FROM horsies_heartbeats WHERE sent_at < NOW() - INTERVAL '24 hours';Troubleshooting
Section titled “Troubleshooting”False Positives (Tasks Marked Stale But Running)
Section titled “False Positives (Tasks Marked Stale But Running)”Increase thresholds:
RecoveryConfig( running_stale_threshold_ms=600_000, # 10 minutes)Common causes:
- CPU-bound tasks blocking heartbeat thread
- Network latency to database
- Database contention
Tasks Not Recovering
Section titled “Tasks Not Recovering”Check:
auto_requeue_stale_claimed/auto_fail_stale_runningenabled?- Reaper loop running? (Check worker logs)
- Database connectivity?