OpenClaw cron jobs enqueue, but nothing executes: full diagnosis and permanent fix
Problem statement: your cron appears healthy, manual trigger says success, but no real work starts. This failure is dangerous because it looks like a minor delay while your automations are already dead.
- Issue #42997 (2026-03-11): manual run enqueues but never executes.
- Issue #42960 (2026-03-11): stale lane marker loop prevents dispatch.
- Issue #42883 (2026-03-11): cron flows break after update.
Why this incident costs more than it looks
Silent cron failure is one of the most expensive OpenClaw failure modes for growing teams. You usually depend on cron for monitoring summaries, lead routing, customer follow-ups, scheduled reporting, and daily housekeeping. When queueing works but execution is blocked, alerts can stay green while business actions never happen.
In practice, teams lose hours because they debug the wrong layer: model provider, API keys, channel permissions, or webhook routing. The bottleneck is often local scheduler state and lane ownership. This page gives you a deterministic, low-risk sequence to recover service and prevent repeat outages.
Typical failure pattern you can recognize fast
- Manual trigger returns accepted or enqueued response.
- No worker startup log for the expected job ID.
- runningAt-like markers stay set much longer than one run window.
- Next schedule appears to queue behind stale state.
- Restart temporarily helps, then the issue returns after load spikes.
Root causes behind “enqueued but idle”
1) Stale running marker loop
If runtime state marks a lane as active before execution actually begins, new runs can be permanently blocked behind a state that never clears. This is especially common after abrupt restarts or regressions in queue state handling.
2) Multiple runtime processes on one cron store
Two gateway processes competing for one state directory can produce non-deterministic dispatch behavior. One instance records queue transitions while the other tries to read stale snapshots. That race can look exactly like a scheduler bug.
3) Upgrade handoff race conditions
During rapid upgrade/restart cycles, job metadata can survive while lane worker ownership does not. The result is a queue that appears valid but has no active worker path to consume it.
4) Heavy host pressure
On saturated VPS hosts, event-loop delays can make scheduler transitions miss expected timing. Queue writes succeed, but handoff to execution can starve under CPU or I/O contention.
15-minute recovery runbook
- Capture evidence first.
Save logs around enqueue event, job ID, lane identifier, and runtime timestamp. You need this for postmortem and to verify that your fix changed behavior. - Check for duplicate gateway processes.
Confirm only one active OpenClaw gateway instance is attached to your cron state directory. - Inspect scheduler state files.
Verify whether lane markers indicate active execution with no matching worker process. - Perform a clean gateway restart.
Avoid force killing first. Use graceful restart to let in-memory state flush and lock ownership reset. - Run exactly one manual trigger.
Observe full lifecycle: enqueue, dispatch, execution start, completion, state clear. - Watch one scheduled cycle end-to-end.
Manual success alone is not enough. Confirm timer-driven run behavior too.
Reference commands
# Check active gateway processes
ps aux | grep -i openclaw | grep -v grep
# Check container duplication (if Docker)
docker ps --format "table {{.Names}} {{.Status}}" | grep -i openclaw
# Verify runtime health
openclaw status
openclaw gateway status
# Restart cleanly
openclaw gateway restart
# or docker compose restart openclaw-gateway
# Re-test cron manually (example command shape)
openclaw cron run --id <job-id> How to diagnose edge cases before they burn another day
Edge case: manual run works, scheduled run still fails
This usually indicates timer path state, not tool path state. Verify timezone alignment, scheduler clock drift, and whether job metadata was edited outside expected schema.
Edge case: only one job fails while others succeed
Focus on that job’s payload complexity. Very long context, blocked tool execution, or provider retries can make it look like dispatch failure when execution is actually hanging at first step.
Edge case: failures after every deployment
Add graceful drain to your deploy script. Do not replace runtime while jobs are mid-flight. Deployment safety is often the hidden fix for “random” cron instability.
Verification checklist
- At least 3 consecutive manual runs completed with expected output.
- At least 2 scheduled runs completed on time.
- No stale running markers remain after completion.
- No duplicate runtime process found during 30-minute watch window.
- Runbook updated in your team docs with exact restart and validation steps.
Preventive controls for stable cron automation
- Enforce single-writer runtime policy in deployment scripts.
- Add scheduled health checks that validate dispatch latency, not only queue acceptance.
- Track queue-to-start time as an SLO and alert on sustained drift.
- Apply upgrade windows with rollback criteria before business-critical schedule windows.
- Document recovery ownership and escalation policy so incidents are not handled ad hoc.
When self-hosted cron overhead is no longer worth it
If your team repeatedly spends engineering time on queue state, restarts, and scheduler drift, the real issue is not one bug. It is operational load. Compare the true cost of recurring maintenance against a managed runtime that keeps scheduling, patching, and reliability guardrails consistent by default.
Fix once. Stop recurring cron queue stalls.
If this keeps coming back, you can move your existing setup to managed OpenClaw cloud hosting instead of rebuilding the same stack. Import your current instance, keep your context, and move onto a runtime with lower ops overhead.
- Import flow in ~1 minute
- Keep your current instance context
- Run with managed security and reliability defaults
If you would rather compare options first, review OpenClaw cloud hosting or see the best OpenClaw hosting options before deciding.
Deep diagnostics when the basic fix is not enough
Inspect queue-to-start latency as a first-class metric
Many teams track only success and failure counts, which hides the critical leading signal: queue-to-start delay. Add a metric that captures time between enqueue timestamp and actual worker start. When this number grows before outright failure, you can catch scheduler instability hours earlier and avoid full outage windows.
Classify failures by stage, not by final error text
A single “job failed” event can represent very different incidents: dispatch never started, worker started but tool call hung, model responded but delivery failed, or output persisted too late. Split your logging by stage and keep one correlation ID through the full path. This reduces false debugging and gets on-call engineers to root cause quickly.
Build an execution canary job
Add one lightweight canary cron that runs every 10–15 minutes and writes a predictable output. If that job is enqueued but not executed within a small SLA window, alert immediately. Canary jobs are cheap and dramatically improve incident detection for teams with business-critical automations.
Common implementation mistakes and exact corrections
- Mistake: restarting repeatedly without evidence capture.
Correction: always capture one pre-restart snapshot of queue and process state. - Mistake: assuming accepted means running.
Correction: require explicit worker-start and worker-complete events in your dashboard. - Mistake: mixing experiment jobs and production jobs in one lane. Correction: isolate high-risk tests from business-critical automations.
- Mistake: upgrading right before critical schedule windows. Correction: move upgrades to low-risk windows and require one successful scheduled cycle before close.
FAQ
Should I delete all cron jobs if they stop executing?
No. Start with process and state diagnostics first. Full cron reset is disruptive and can hide the root cause.
Is this always caused by an OpenClaw regression?
Not always. Host saturation, duplicate processes, and unsafe deploy handoffs can trigger the same symptoms.
Where should I start if I am setting up cron from scratch?
Start with the secure deployment baseline at /openclaw-setup/, then evaluate managed options at /openclaw-cloud-hosting/ and decision criteria at /compare/.