web_fetch breaks on dual-stack DNS? Here is the production-safe fix path
Problem statement: after upgrading, web_fetch fails for public domains on
dual-stack networks where DNS returns both IPv4 and IPv6 records, and one IPv6 record belongs to a
blocked special-use range. Teams see tool failures on ordinary websites and lose critical agent workflows
(research, summaries, crawling pipelines, customer support automations).
A newly opened GitHub bug describes this as a regression in v2026.3.8, including a concrete root cause: address-policy checks reject the entire lookup result if any resolved address is blocked, even when another address in the same DNS response is valid for public access.
Why this failure happens
The intent of SSRF hardening is correct: prevent tools from reaching local/private addresses that could leak secrets or pivot into internal networks. The problem appears when the policy is applied with “all-or-nothing” logic to mixed DNS answers. In dual-stack reality, DNS responses frequently include multiple candidates, and some networks surface unusual IPv6 entries. If policy blocks the whole set instead of selecting an allowed route, legitimate outbound fetches fail.
This is a classic security-versus-availability edge case. You want strict SSRF controls and high reliability. Production-safe remediation should keep both, not trade one for the other.
Symptoms checklist
web_fetchfails for public domains that should be reachable.- Failure appears after upgrade and did not occur in prior build.
- Issue reproduces on one network but not another.
- DNS lookups return mixed A and AAAA records.
- Error pattern points to blocked resolved IP policy check.
Step-by-step diagnosis
1) Confirm baseline reachability outside OpenClaw
First verify that the host itself can access the target site through standard networking tools. If host-level connectivity is broken, you are debugging the wrong layer. Keep a simple baseline list of 3-5 test domains and record results before and after each change.
2) Compare DNS responses across environments
Capture A/AAAA answers from the affected environment and from a known-good environment. If affected DNS returns special-use IPv6 candidates in the same answer set, you have strong evidence for mixed-result rejection.
3) Reproduce with a minimal tool invocation
Use a tiny fetch target and minimal prompt to remove unrelated complexity. You want a deterministic pass/fail test that can run repeatedly during mitigation. Keep timestamped logs for each attempt.
4) Validate policy behavior, not only outcome
It is not enough to say "it failed." Validate whether policy rejected one address and then aborted the whole operation, versus selecting an allowed address. This distinction determines whether your mitigation should be DNS-path tuning, runtime patching, or routing override.
5) Track regression boundary
If you can reproduce on current build and not on prior build, capture that boundary in your incident notes. Maintainers can fix significantly faster when the boundary is explicit.
Safe mitigation options (without weakening security)
Option A: controlled DNS path
Route affected workloads through DNS resolvers that provide clean public answer sets for your target domains. This is often the fastest low-risk mitigation for production continuity. Document resolver choice and fallback.
Option B: network segmentation for fetch workloads
Isolate web_fetch-heavy workflows into environment profiles where dual-stack behavior is validated and stable. Keep sensitive internal workloads separate so a fetch incident cannot cascade into broader pipeline instability.
Option C: patch and staged rollout
Once a fix lands upstream, deploy first into staging with mixed DNS test matrix, then canary into production. Reject “big bang” rollout for network-policy changes. Include automatic rollback trigger if failure rate rises.
Option D: graceful fallback in business logic
For critical user flows, add a controlled fallback path (for example cached summaries or alternate retrieval source) so one failed fetch does not break end-user response. This keeps customer-facing reliability high while backend network issues are remediated.
Operational response model for engineering leads
If you own platform reliability, treat this as a cross-functional event, not a local tooling bug. Coordinate network, security, and product teams in one short response loop: define user impact, choose mitigation, assign owner per action, and review every 4-6 hours until stability is proven. This avoids the common failure where one team “fixes” DNS while another unknowingly reintroduces risk through policy overrides.
Keep a single incident channel and one source of truth for test outcomes. Fragmented debugging across tickets is the fastest way to lose root-cause clarity.
What you should never do
- Do not globally disable SSRF policy checks to “make it work.”
- Do not assume all IPv6 answers are safe because domain is public.
- Do not test only one domain and declare incident solved.
- Do not roll out policy changes without canary metrics.
- Do not ignore security team review for networking exceptions.
Edge cases that cause repeated outages
Resolver drift: teams update one DNS resolver but forget fallback resolvers used by autoscaled workers. Incidents reappear under load because part of the fleet still receives problematic answers.
Container-vs-host mismatch: host DNS path differs from container DNS path, leading to false confidence during manual testing. Always validate from the same runtime context as OpenClaw workers.
Country-specific routing behavior: regional DNS/CDN behavior can change address sets. If your users are international, test from at least two regions before closing the incident.
Verification checklist before closure
- Successful fetches for all baseline domains in affected environment.
- Successful fetches on both IPv4-heavy and dual-stack-heavy targets.
- No access to blocked internal/special-use addresses through tool path.
- Stable failure rate under burst load test.
- Post-change monitoring alerting on sudden fetch error increases.
Reduce network surprises in production OpenClaw
If your team spends too much time on DNS, proxy, and runtime edge cases, move to a setup designed for predictable agent networking and faster recoveries.
FAQ
Could this be caused by our proxy only?
Proxy configuration can amplify the problem, but the reported pattern is specifically about how resolved addresses are evaluated. Validate both DNS and proxy paths before narrowing root cause.
Should we force IPv4 everywhere?
Forcing IPv4 can be a temporary mitigation in some environments, but it is usually a blunt instrument. Prefer targeted policy/path correction that preserves healthy dual-stack operation long-term.
How does this relate to migration decisions?
Frequent network regressions are a strong signal to revisit operational ownership. If your product team keeps losing cycles to infrastructure troubleshooting, review managed vs self-hosted tradeoffs and the guided setup path.
Sources
- GitHub Issue #41993: web_fetch dual-stack IPv6 regression (updated 2026-03-10)
- Open issues feed used for recency validation
Fix once. Stop recurring web_fetch network regressions.
If this keeps coming back, you can move your existing setup to managed OpenClaw cloud hosting instead of rebuilding the same stack. Import your current instance, keep your context, and move onto a runtime with lower ops overhead.
- Import flow in ~1 minute
- Keep your current instance context
- Run with managed security and reliability defaults
If you would rather compare options first, review OpenClaw cloud hosting or see the best OpenClaw hosting options before deciding.