Blog

OpenClaw cron websocket fix: when status looks healthy but jobs fail

Problem statement: openclaw status says the system is up, but cron commands like list, run, disable, or add fail with WebSocket handshake errors, RPC close events, or 1000 close disconnects. This mismatch is dangerous because it creates the illusion of a healthy gateway while your actual scheduling workflow is still broken.

Evidence from the field
  • GitHub issue #45750 was opened on 2026-03-14 after cron commands failed with gateway close and handshake errors while openclaw status still worked.
  • The same report notes that the incident blocked normal cron operations and pushed the operator into manual edits of ~/.openclaw/cron/jobs.json.
  • Supporting issue #45222 from 2026-03-13 shows intermittent loopback handshake failures at ws://127.0.0.1:18789, again breaking cron flows while the environment otherwise looked mostly alive.
  • Across OpenClaw Setup’s hosted workflows, reliability comes from making the control surface boring: import first, review state clearly, and avoid hidden transport assumptions. Cron transport problems are usually easier to fix when you simplify the execution path the same way.

Why this incident confuses even experienced operators

When one command works and another fails on the same host, people naturally assume the failing command is the odd one out. But with OpenClaw, that is not a safe assumption. Different commands can touch different timing windows, connection pools, RPC paths, or websocket flows. A healthy-looking status check tells you something important, but it does not prove the cron command path is healthy.

This is why teams lose time on bad fixes. They restart everything, blame the scheduler, or change cron definitions that were never the problem. The real work is isolating the transport boundary: which exact path succeeds, which path fails, and whether loopback websocket handling is intermittently dropping the cron command flow.

What usually causes this specific mismatch

  1. Intermittent websocket handshake failures on loopback: the local bind exists, but specific command flows fail during handshake.
  2. Gateway close events under local load: status succeeds during a healthy moment while cron commands hit a close event.
  3. Regression in the RPC path used by cron: command-specific routing breaks even though a lighter status path still returns useful output.
  4. Daemon environment drift: gateway is running, but with stale config or stale process state that only shows up under cron operations.
  5. Temporary manual edits masking the real fault: once teams start bypassing normal flows, root-cause isolation gets messier.

Immediate triage: what to do in the first 15 minutes

1) Capture one successful status call and one failing cron call

You need both, side by side. Without that contrast, you are debugging a story instead of an incident. Save timestamps, exact commands, exact error strings, and the gateway bind details.

2) Confirm whether the failure is intermittent or constant

A constant failure points you toward config and process-state issues. An intermittent failure often points toward timing, handshake instability, or transport regression. The two feel similar in chat, but they should be debugged differently.

3) Freeze unnecessary changes

Do not edit jobs, schedule new jobs, change models, and restart unrelated services all at once. Preserve the incident window long enough to understand it.

Practical diagnostic flow

Step 1: isolate loopback as a suspect

Issue #45222 is especially useful because it narrows the failure to local loopback websocket handling. If your bind is on 127.0.0.1:18789 or a similar local control endpoint, treat that path as a prime suspect until proven otherwise.

  • Confirm the gateway bind address and port.
  • Check whether failures cluster around the same loopback endpoint.
  • Compare success and failure times instead of relying on one command attempt.

Step 2: test the full cron surface, not just one command

If cron list fails but cron run works once, the incident is not resolved. Run the sequence that mirrors real use: list, add a disposable test job, disable it, and run it if appropriate. You are trying to prove the path is stable, not merely lucky.

Step 3: inspect daemon freshness

A stale daemon can keep enough state to answer status while failing on more active command flows. Restarting the gateway may help, but only after you record the evidence you need. Otherwise you risk turning an actionable regression into an anecdote you cannot reproduce.

Step 4: separate scheduler logic from transport logic

If cron commands do not reach the gateway reliably, changing the cron payload or expression will not fix the core issue. First recover transport stability. Then return to job-level validation.

Production-safe recovery steps

1) Use the least destructive restart path available

Restart the gateway cleanly instead of killing multiple processes blindly. The point is to refresh the control path without adding more uncertainty than necessary.

2) Re-test cron commands in a tight, controlled sequence

  • Run openclaw status.
  • Run openclaw cron list.
  • Create or inspect one disposable test job.
  • Disable or remove the test job after validation.

3) Avoid long-term dependence on direct file edits

Editing ~/.openclaw/cron/jobs.json can restore continuity in an emergency, but it is a bridge, not a home. Once the websocket path is stable again, move back to normal command-based management so future changes remain auditable.

Edge cases that are easy to miss

  • One command works because it is cached or lighter: do not mistake lightweight health output for full operational health.
  • Intermittent success after restart: one green command after restart is not closure.
  • Multiple users or instances on the same host: cron failures may be shared infrastructure pain, not one bad job set.
  • Stale environment in a long-running daemon: recent config changes may not be fully reflected in the live process.
  • Manual edits during panic: they can rescue a schedule today and confuse recovery tomorrow.

How to verify the fix properly

  1. Run repeated cron list calls without intermittent handshake failure.
  2. Confirm add, disable, and run all succeed in the same maintenance window.
  3. Watch logs long enough to make sure the connection remains stable instead of failing again after a few minutes.
  4. Remove any emergency manual drift you introduced during recovery.
  5. Document the exact acceptance test so the next incident closes faster.

Typical mistakes that prolong this outage

  • Treating a successful status check as proof that cron is fine.
  • Changing job definitions before confirming transport stability.
  • Skipping logs because the failure appears “obvious.”
  • Assuming loopback means the network layer cannot be involved.
  • Leaving manual jobs.json edits in place after the gateway recovers.

When managed operations become the simpler answer

If your team depends on scheduled workflows, then a flaky cron control path is not a minor annoyance. It is a direct tax on reliability. The hidden cost is engineer interruption: repeated triage, command re-testing, daemon babysitting, and uncertainty every time a scheduled job matters.

Start with /compare/ if you want a sober view of self-hosted versus managed tradeoffs. If the bigger problem is keeping a production OpenClaw instance boring and available, see /openclaw-cloud-hosting/ or import your current OpenClaw instance in 1 click.

Fix once. Stop recurring cron websocket and gateway-close incidents.

If this keeps coming back, you can move your existing setup to managed OpenClaw cloud hosting instead of rebuilding the same stack. Import your current instance, keep your context, and move onto a runtime with lower ops overhead.

  • Import flow in ~1 minute
  • Keep your current instance context
  • Run with managed security and reliability defaults

If you would rather compare options first, review OpenClaw cloud hosting or see the best OpenClaw hosting options before deciding.

OpenClaw import first screen in OpenClaw Setup dashboard (light theme) OpenClaw import first screen in OpenClaw Setup dashboard (dark theme)
1) Paste import payload
OpenClaw import completed screen in OpenClaw Setup dashboard (light theme) OpenClaw import completed screen in OpenClaw Setup dashboard (dark theme)
2) Review and launch

Move critical cron workflows to managed hosting

Native next step if cron reliability is business-critical

If scheduled workflows drive reporting, support, or customer-facing automations, reduce the number of transport layers you have to babysit. Hosted control plus clean import is often the fastest way to stop repeating the same recovery.

Import your current OpenClaw instance in 1 click See the self-managed setup path

FAQ

Why does cron break first when the gateway is unstable?

Because cron management is an active control path. It often needs more than a shallow health response. That makes it more sensitive to handshake, RPC, and websocket instability than a simple status call.

Should I recreate all my jobs after recovery?

Not automatically. First confirm that command-based management is stable. Then audit job definitions and remove any emergency drift introduced during the outage.

Can this be caused by a bad cron expression?

A bad cron expression can break job behavior, but it does not usually explain websocket handshake or gateway close errors while openclaw status continues to work. Start with transport and daemon health first.

Cookie preferences