All HTTP goes through the Rust engine via agentos.http. The engine handles transport mechanics (HTTP/2, cookie jars, decompression, timeouts, logging). Headers are built in Python via http.headers() — the engine sets zero default headers.
Default rule: ALWAYS use http.headers(). Never construct headers dicts manually.
We are acting as a real browser (Brave/Chrome). There is no reason to NOT send proper
browser headers. Without http.headers(), you get no User-Agent, no sec-ch-, no
Sec-Fetch- — and some APIs silently reject you with 500 or 403. Pass service-specific
headers (CSRF tokens, session IDs) via the extra= parameter.
# WRONG — no browser headers, will fail on strict endpoints
AWS WAF, Cloudflare, and other CDNs compute a JA3/JA4 fingerprint from every TLS ClientHello and compare it to the claimed User-Agent. If the UA says “Chrome 131” but the TLS fingerprint says “rustls” or “urllib3,” the request gets flagged as a bot. Sensitive pages (Amazon orders, Chase banking, account settings) have higher anomaly thresholds than product pages — so the homepage works but the orders page redirects to login.
The engine uses wreq (a reqwest fork) backed by BoringSSL — the same TLS library Chrome uses. With Emulation::Chrome131, every request produces an authentic Chrome JA4 fingerprint (t13d1516h2_8daaf6152771), including correct HTTP/2 SETTINGS frames, pseudo-header order, and WINDOW_UPDATE values. This is not string-matching — wreq constructs the same ClientHello Chrome would, using the same library, and the fingerprint falls out naturally.
Verified (2026-04-01): Same cookies from Brave Browser. reqwest (rustls) → Amazon redirects to signin. wreq (BoringSSL, Chrome 131) → Amazon returns 7 orders. The only difference was the TLS fingerprint.
Python clients (requests, httpx) have similar issues — requests/urllib3 has a blocklisted JA3 hash (8d9f7747675e24454cd9b7ed35c58707). Skills don’t hit this because all HTTP goes through the engine’s wreq client, not Python libraries directly.
Vercel Security Checkpoint blocks HTTP/2 clients outright — every request
returns 429 with a JS challenge page, regardless of cookies or headers. But
HTTP/1.1 passes cleanly.
In http.headers(), this is handled by the waf= knob:
# waf="cf" → http2=True (CloudFront/Cloudflare need HTTP/2)
The WAF template automatically sets the right http2 value. No need to remember which WAF needs what.
Not every Vercel-hosted endpoint enables the checkpoint. During Exa testing,
auth.exa.ai (Vercel, no checkpoint) accepted h2; dashboard.exa.ai
(Vercel, checkpoint enabled) rejected it. The checkpoint is a per-project
Vercel Firewall setting — you have to test each subdomain.
Tested against dashboard.exa.ai (Vercel + Cloudflare):
http2=True
http2=False
session + cf_clearance
429
200
session only
429
200
no cookies at all
429
200 (empty session)
Cookies and headers are irrelevant — the checkpoint triggers purely on
the HTTP/2 TLS fingerprint.
Rule of thumb: use waf="cf" for CloudFront/Cloudflare, waf="vercel" for Vercel. If you get 429 from Vercel, it’s the HTTP/2 fingerprint. If you get 403 from CloudFront, you need HTTP/2 + client hints.
When a request fails, don’t guess — isolate. Test each transport variable
independently to find the one that matters:
Step 1: Try httpx http2=True (default)
→ Works? Done.
→ 429/403? Continue.
Step 2: Try httpx http2=False
→ Works? Vercel Security Checkpoint. Use http2=False, done.
→ Still 403? Continue.
Step 3: Try with full browser-like headers (Sec-Fetch-*, Sec-CH-UA, etc.)
→ Works? WAF header check. Add headers, done.
→ Still 403? Continue.
Step 4: Try with valid session cookies
→ Works? Auth required. Handle login first.
→ Still 403? It's TLS fingerprint-level.
Step 5: Use curl_cffi with Chrome impersonation
→ Works? Strict JA3/JA4 enforcement. Use curl_cffi.
→ Still 403? Something non-standard (CAPTCHA, IP block).
The key insight from the Exa reverse engineering session: test one variable
at a time. During Exa testing, we created a matrix of http2=True/False x
cookies/no-cookies x headers/no-headers and discovered that ONLY the h2
setting mattered. Cookies and headers were completely irrelevant to the
Vercel checkpoint. This prevented unnecessary complexity in the skill code.
The engine’s wreq client already emits Chrome’s exact TLS cipher suites, GREASE values, extension ordering, ALPN, and HTTP/2 SETTINGS frames. Skills should never use httpx, requests, or curl_cffi directly — agentos.http handles all of this automatically.
Skills must use agentos.http for all HTTP — never urllib, requests, httpx, or subprocess directly. All I/O goes through SDK modules (http.get/post, shell.run, sql.query) so the engine can log, gate, and manage requests.
Every http.headers() call sets User-Agent, Accept-Language, and Accept-Encoding. These are normal browser headers — not WAF-specific. Override via extra= if needed.
Plus device hints: Device-Memory, Downlink, DPR, ECT, RTT, Viewport-Width
Plus Cache-Control: max-age=0, Upgrade-Insecure-Requests: 1
Amazon’s Lightsaber bot detection checks these device hints. Without them, auth pages redirect to login. The mode="navigate" knob handles all of this automatically.
The Chrome version in Sec-CH-UA is pinned in sdk/agentos/http.py (_UA and _WAF dicts).
If you start getting unexpected 403s months later, the pinned version may be too old.
Update the version strings in the SDK to match the current stable Chrome release.
Use the Playwright skill’s capture_network or the fetch interceptor to see exactly
what headers a real browser sends on the same request. Compare with http.headers() output
and add any missing ones via extra=.
Some sites inject JavaScript-driven features via cookies. When you’re scraping
with HTTPX (no JS engine), these features produce unusable output. The fix:
strip the trigger cookies so the server falls back to plain HTML.
Amazon uses a system called SiegeClientSideDecryption to encrypt page content
client-side. When the csd-key cookie is present, Amazon sends encrypted HTML
blobs instead of readable content. The browser decrypts them with JavaScript;
HTTPX gets unreadable garbage.
Solution: strip the trigger cookies using skip_cookies= on http.client():
with http.client(cookies=cookie_header, skip_cookies=_SKIP_COOKIES,
**http.headers(waf="cf", mode="navigate", accept="html")) as c:
resp = c.get(url)
The engine filters these cookies out of the jar before sending. With csd-key stripped, Amazon serves plain, parseable HTML. The csm-hit and aws-waf-token cookies are also stripped — they’re telemetry/WAF cookies that can trigger additional client-side behavior.
When you send Accept-Encoding: gzip, deflate, br, zstd (as all browser-like profiles do), the server will compress its response. Your HTTP client must decompress it. If it doesn’t, you get raw binary garbage instead of HTML — and every parser returns zero results.
This is a silent failure. The HTTP status is 200, the headers look normal, and Content-Length is reasonable. But resp.text is garbled bytes. It looks like client-side encryption (see above), but the cause is much simpler: the response is compressed and you’re not decompressing it.
The Rust HTTP engine uses wreq with gzip, brotli, deflate, and zstd feature flags enabled. Decompression is automatic and transparent — resp["body"] is always plaintext.
Brotli (RFC 7932) is a compression algorithm designed by Google for the web. It compresses 20-26% better than gzip on HTML/CSS/JS. Every modern browser supports it, and servers aggressively use it for large pages. Amazon’s order history page, for example, returns ~168KB of brotli-compressed HTML. Without decompression, you get 168KB of binary noise and zero order cards.
The trap: small pages (homepages, API endpoints) may not be compressed or may use gzip which some clients handle by default. Large pages (order history, dashboards, search results) almost always use brotli. So your skill works on simple endpoints and silently fails on the important ones.
Some services track request patterns and flag direct deep-links from an unknown
session as bot traffic. The fix: warm the session by visiting the homepage
first, then navigate to the target page.
def_warm_session(client) -> None:
"""Visit homepage first to provision session cookies."""
This establishes the session context (cookies, CSRF tokens, tracking state)
before hitting authenticated pages. Without it, Amazon redirects order history
and account pages to the login page even with valid session cookies.
When to warm:
Before any authenticated page fetch (order history, account settings)
When the first request to a deep URL returns a login redirect despite valid cookies
When you see WAF-level blocks only on direct navigation
When warming isn’t needed:
API endpoints (JSON responses) — they don’t use page-level session tracking
Public pages without authentication
Sites where direct deep-links work fine (test first)
Default Playwright/Chromium gets blocked by many sites (Goodreads returns 403,
Cloudflare serves challenge pages). The fix is a set of anti-fingerprinting settings.
Default Playwright against goodreads.com/book/show/4934 returns HTTP 403 with
one network request. With stealth settings, the page loads fully with 1400+ requests
including 4 AppSync GraphQL calls. See skills/goodreads/public_graph.pydiscover_via_browser() for the implementation.
Even with the stealth settings above, Playwright is still detectable at the
Chrome DevTools Protocol (CDP) layer. These signals are invisible in
DevTools and unrelated to headers, cookies, or user-agent strings. They matter
most during reverse engineering sessions — if a site behaves differently under
Playwright than in your real browser, CDP leaks are likely the cause.
Playwright calls Runtime.Enable on every CDP session to receive execution
context events. Anti-bot systems (Cloudflare, DataDome) detect this with a few
lines of in-page JavaScript that only fire when Runtime.Enable is active.
This is the single most devastating detection vector — it works regardless of
all other stealth measures.
Playwright appends //# sourceURL=__playwright_evaluation_script__ to every
page.evaluate() call. Any page script can inspect error stack traces and see
these telltale URLs. This means your __NEXT_DATA__ extraction, DOM inspection,
or any other evaluate() call leaves a fingerprint.
Playwright creates an isolated world named __playwright_utility_world__ that
is visible in Chrome’s internal state and potentially to detection scripts.
These leaks are baked into Playwright’s source code — no launch flag or init
script fixes them. Two options:
For most RE work: The stealth settings above (flags, UA, viewport,
webdriver override) are enough. Most sites don’t check CDP-level signals.
If a site seems to behave differently under Playwright, check for these
leaks before adding complexity.
For strict sites (Cloudflare Bot Management, DataDome): Use
rebrowser-playwright
as a drop-in replacement. It patches Playwright’s source to eliminate
Runtime.Enable calls, randomize sourceURLs, and rename the utility
world. Install: npm install rebrowser-playwright and change your import.
This doesn’t affect production skills. Our architecture uses Playwright
only for discovery — production calls go through surf() / HTTPX, which has
zero CDP surface. The CDP leaks only matter during reverse engineering sessions
where you’re using the browser to investigate a protected site.
When a cookie provider (brave-browser, firefox) extracts cookies for a domain like .uber.com, it returns cookies from ALL subdomains: .uber.com, .riders.uber.com, .auth.uber.com, .www.uber.com. If the skill’s base_url is https://riders.uber.com, sending cookies from .auth.uber.com is wrong — the server picks the wrong csid and redirects to login.
The engine implements RFC 6265 domain matching: when resolving cookies, it extracts the host from connection.base_url and passes it to the cookie provider. The provider filters cookies so only matching ones are returned:
host = "riders.uber.com"
.uber.com → riders.uber.com ends with .uber.com → KEEP (parent domain)
.auth.uber.com → riders.uber.com doesn't match → DROP (sibling)
.www.uber.com → riders.uber.com doesn't match → DROP (sibling)
This is automatic — skills don’t need to do anything. The filtering happens in the cookie provider (brave-browser/get-cookie.py, firefox/firefox.py) based on the host parameter the engine passes from connection.base_url.
When it matters: Only when a domain has cookies on multiple subdomains with the same cookie name. Most skills are unaffected — Amazon, Goodreads, Chase all have cookies on a single domain. Uber is the first case where it matters.
The old workaround: Before RFC 6265 filtering, the Uber skill had a _filter_cookies() function that deduplicated by cookie name (last occurrence wins). This has been removed — the provider handles it correctly now.
http.cookies() uses the same auth resolver as connection-based auth: it tries all installed cookie providers (brave-browser, firefox, etc.), picks the best one, and returns a cookie header string. No hardcoded provider names in skill code.
When you’re stuck, use Playwright to intercept the actual XHR and log all headers
(including those added by axios interceptors that aren’t visible in DevTools):