Reverse Engineering — Discovery & Data Extraction
Once you can talk to the server (see Transport), how do you find and extract structured data?
Tool: browse capture (bin/browse-capture.py) is the primary discovery tool. It connects to your real browser (Brave/Chrome) via CDP and captures all network traffic with full headers and response bodies. For DOM inspection, use the browser’s own DevTools. See the overview for the full toolkit.
Why not Playwright? Playwright’s bundled Chromium has a detectable TLS fingerprint. Sites like Amazon and Cloudflare-protected services reject it. CDP to a real browser produces authentic fingerprints and uses existing sessions. See Transport.
Next.js + Apollo Cache Extraction
Section titled “Next.js + Apollo Cache Extraction”Many modern sites (Goodreads, Airbnb, etc.) use Next.js with Apollo Client. These pages ship a full serialized Apollo cache in the HTML — structured entity data that you can parse without scraping visible HTML.
Where to find it
Section titled “Where to find it”<script id="__NEXT_DATA__" type="application/json">{ ... }</script>Inside that JSON:
__NEXT_DATA__ .props.pageProps .props.pageProps.apolloState <-- the gold .props.pageProps.apolloState.ROOT_QUERYHow Apollo normalized cache works
Section titled “How Apollo normalized cache works”Apollo stores GraphQL results as a flat dictionary keyed by entity type and ID.
Related entities are stored as {"__ref": "Book:kca://book/..."} pointers.
import json, re
def extract_next_data(html: str) -> dict: match = re.search( r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>', html, re.S, ) if not match: raise RuntimeError("No __NEXT_DATA__ found") return json.loads(match.group(1))
def deref(apollo: dict, value): """Resolve Apollo __ref pointers to their actual objects.""" if isinstance(value, dict) and "__ref" in value: return apollo.get(value["__ref"]) return valueExtraction pattern
Section titled “Extraction pattern”next_data = extract_next_data(html)apollo = next_data["props"]["pageProps"]["apolloState"]root_query = apollo["ROOT_QUERY"]
# Find the entity by its query keybook_ref = root_query['getBookByLegacyId({"legacyId":"4934"})']book = apollo[book_ref["__ref"]]
# Dereference related entitieswork = deref(apollo, book.get("work"))primary_author = deref(apollo, book.get("primaryContributorEdge", {}).get("node"))What you typically find in the Apollo cache
Section titled “What you typically find in the Apollo cache”| Entity type | Common fields |
|---|---|
| Books | title, description, imageUrl, webUrl, legacyId, details (isbn, pages, publisher) |
| Contributors | name, legacyId, webUrl, profileImageUrl |
| Works | stats (averageRating, ratingsCount), details (originalTitle, publicationTime) |
| Social signals | shelf counts (CURRENTLY_READING, TO_READ) |
| Genres | name, webUrl |
| Series | title, webUrl |
The Apollo cache often contains more data than the visible page renders. Always
dump and inspect apolloState before assuming you need to make additional API calls.
Real example: Goodreads
Section titled “Real example: Goodreads”See skills/goodreads/public_graph.py functions load_book_page() and
map_book_payload() for a complete implementation that extracts 25+ fields from
the Apollo cache without any GraphQL calls.
JS Bundle Scanning
Section titled “JS Bundle Scanning”SPAs embed everything in their JavaScript bundles — config values, API keys, custom endpoints, and auth flow logic. Scanning bundles is one of the highest- value reverse engineering techniques. It works without login, reveals hidden endpoints that network capture misses, and exposes the exact contracts the frontend uses.
Two levels of bundle scanning
Section titled “Two levels of bundle scanning”Level 1: Config extraction — find API keys, endpoints, tenant IDs. Standard search for known patterns.
Level 2: Endpoint and flow discovery — find custom API endpoints that
aren’t in the standard framework (e.g. /api/verify-otp), understand what
parameters they accept, and how the frontend processes the response. This
is how you crack custom auth flows.
General pattern
Section titled “General pattern”import re, httpx
def scan_bundles(page_url: str, search_terms: list[str]) -> dict: """Fetch a page, extract all JS bundle URLs, scan each for search terms.""" with httpx.Client(http2=False, follow_redirects=True, timeout=30) as client: html = client.get(page_url).text
# Extract all JS chunk URLs (Next.js / Turbopack pattern) js_urls = list(set(re.findall( r'["\'](/_next/static/[^"\' >]+\.js[^"\' >]*)', html )))
results = {} for url in js_urls: js = client.get(f"{page_url.split('//')[0]}//{page_url.split('//')[1].split('/')[0]}{url}").text for term in search_terms: if term.lower() in js.lower(): # Extract context around the match idx = js.lower().find(term.lower()) context = js[max(0, idx-100):idx+200] results.setdefault(term, []).append({ "chunk": url[-40:], "size": len(js), "context": context, }) return resultsConfig patterns to search for
Section titled “Config patterns to search for”| What | Search terms |
|---|---|
| API keys | apiKey, api_key, X-Api-Key, widgetsApiKey |
| GraphQL endpoints | appsync-api, graphql |
| Tenant / namespace | host.split, subdomain |
| Cognito credentials | userPoolId, userPoolClientId |
| Auth endpoints | AuthFlow, InitiateAuth, cognito-idp |
Custom endpoint patterns to search for
Section titled “Custom endpoint patterns to search for”| What | Search terms |
|---|---|
| Custom auth flows | verify-otp, verify-code, verify-token, confirm-code |
| Hidden API routes | fetch(, /api/ |
| Token construction | callback/email, hashedOtp, rawOtp, token= |
| Form submission handlers | submit, handleSubmit, onSubmit |
How we cracked Exa’s custom OTP flow
Section titled “How we cracked Exa’s custom OTP flow”Exa’s login page uses a custom 6-digit OTP system built on top of NextAuth.
The standard NextAuth callback failed with error=Verification. Scanning
the JS bundles revealed the actual flow:
# Search terms that found the hidden endpointresults = scan_bundles("https://auth.exa.ai", ["verify-otp", "verify-code", "callback/email"])In a 573KB chunk, this surfaced:
fetch("/api/verify-otp", {method: "POST", headers: {"Content-Type": "application/json"}, body: JSON.stringify({email: e.toLowerCase(), otp: r})})// → response: {email, hashedOtp, rawOtp}// → constructs: token = hashedOtp + ":" + rawOtp// → redirects to: /api/auth/callback/email?token=...&email=...This revealed the entire auth flow — custom endpoint, request/response shape, and token construction — all from static JS analysis.
Multi-environment configs
Section titled “Multi-environment configs”Many sites ship all environment configs in the same bundle. Goodreads ships four AppSync configurations with labeled environments:
{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":false,"shortName":"Dev"}{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":false,"shortName":"Beta"}{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":true,"shortName":"Preprod"}{"graphql":{"apiKey":"da2-...","endpoint":"https://...appsync-api...amazonaws.com/graphql","region":"us-east-1"},"showAds":true,"shortName":"Prod"}Pick the right one by looking for identifiers like shortName, showAds: true,
publishWebVitalMetrics: true, or simply taking the last entry (Prod is typically
last in webpack build output).
The “Authorization is the namespace” pattern
Section titled “The “Authorization is the namespace” pattern”Some APIs use the Authorization header not for a JWT but for a tenant namespace
extracted from the subdomain at runtime:
Jl = () => host.split(".")[0] // -> "boulderingproject"headers: { Authorization: Jl(), "X-Api-Key": widgetsApiKey }If you see Authorization values that seem too short to be JWTs, look for the
function that generates them near the axios/fetch client factory in the bundle.
Real examples
Section titled “Real examples”- Goodreads:
skills/goodreads/public_graph.pydiscover_from_bundle()— extracts Prod AppSync config from_appchunk - Austin Boulder Project:
skills/austin-boulder-project/abp.py— API key and namespace from Tilefive bundle
Error-Message Probing
Section titled “Error-Message Probing”When the bundle reveals the endpoint but not the exact request body, just make a wrong request and read the error. Server error messages often leak the missing field name directly — especially for Node/Express/Knex APIs, which pass SQL errors through mostly un-sanitized.
Technique
Section titled “Technique”- Mint a valid auth token (see 3-auth).
- Call the endpoint with an obviously-incomplete body.
- Read the error. Iterate.
Real example: Austin Boulder Project book_class
Section titled “Real example: Austin Boulder Project book_class”The Tilefive bundle showed the endpoint was POST /bookings/{id}/customers
with some body t, but the body shape was buried in minified closures.
First attempt:
curl -s -X POST "https://portal.api.prod.tilefive.com/bookings/837772/customers" \ -H "Authorization: $IDTOKEN" \ -H "Content-Type: application/json" \ -d '{"numGuests": 0}'Response:
{"message":"Undefined binding(s) detected when compiling SELECT. Undefined column(s): [id] query: select `Customer`.* from `Customer` where `id` = ? limit ?"}The server leaked its actual SQL: it’s running
SELECT * FROM Customer WHERE id = ? and the ? is undefined. So the body
needs a customer ID. Next attempt includes customerId:
curl -s -X POST ... -d '{"customerId": 1128331, "numGuests": 0}'Response changes to a business-rule error (Pass or Membership required), which means the body shape is correct — we’re past data
validation. Total: two requests to reverse the payload.
When this fails
Section titled “When this fails”- Sanitized APIs (Stripe, Anthropic, most big vendors) return generic
{error: "Bad Request"}— they don’t leak schema. Fall back to bundle scanning for the request builder. - GraphQL APIs return schema-introspection errors that are even more
informative. Try introspection first:
{"query": "{__schema{types{name}}}"}. - Errors with stack traces often expose the server file path
(
/var/task/api/Customer/booking.js:175992:41) — gives you the service name and sometimes hints at the data model.
Why it works on Node/Express
Section titled “Why it works on Node/Express”Unhandled database errors in Express often propagate to the default error handler with the full Knex/Sequelize error object intact. Production code should catch and sanitize, but most don’t. Tilefive, Discourse, and many mid-size platforms leak this way. Treat server errors as free documentation.
Error messages as a state machine
Section titled “Error messages as a state machine”Distinct error messages are state transitions — treat them that way. Each failure mode narrows the search space:
"Undefined binding [id]"— schema-level complaint → body missing a required field."Pass or Membership required"— body structure valid, business rule failed → check entitlements."Reservation Already Made!"— body + entitlements valid, duplicate-key failure → success on first attempt.
When you see a new distinct message, ask: “what state did I just enter?” and “what single body change gets me to the next?” Usually two or three iterations reveal the full required-field set. Log each attempt’s body
- response so you can diff what changed.
When to stop reading the bundle
Section titled “When to stop reading the bundle”Bundle scanning is great for static payload structure — API paths, header templates, config constants. It’s bad for runtime-dependent payload shape, like “what do I send when I have membership X vs pass Y.” That logic often lives in minified closures you can’t easily call.
Heuristic: if the request body depends on server state you can query (your memberships, your cart contents, your permissions), skip the bundle and probe with error-iteration — it’s usually 1-2 requests. If the body depends on client-only state (form data, computed tokens, hashing), the bundle is where the answer lives.
Navigation API Interception
Section titled “Navigation API Interception”When JS bundle scanning reveals what endpoint gets called but not what happens with the result (e.g. a client-side token construction), you need to see the actual values the browser produces. The Navigation API interceptor is the key technique.
The problem
Section titled “The problem”Client-side JS often does: fetch → process response → set window.location.href.
Once the navigation fires, the page is gone and you can’t inspect the URL. Network
capture only catches the fetch, not the outbound navigation. And the processing
logic is buried in minified closures you can’t easily call.
The solution
Section titled “The solution”Modern Chrome exposes the Navigation API.
You can intercept navigation attempts, capture the destination URL, and prevent
the actual navigation — all with a single evaluate call:
evaluate { script: "navigation.addEventListener('navigate', (e) => { window.__intercepted_nav_url = e.destination.url; e.preventDefault(); }); 'interceptor installed'" }Then trigger the action (click a button, submit a form), and read the captured URL:
click { selector: "button#submit" }evaluate { script: "window.__intercepted_nav_url" }The URL contains whatever the client-side JS constructed — tokens, hashes, callback parameters — fully assembled and ready to replay with HTTPX.
When to use this
Section titled “When to use this”| Situation | Technique |
|---|---|
Button click makes a fetch() call | Fetch interceptor (see 3-auth) |
| Button click causes a page navigation | Navigation API interceptor |
| Form does a native POST (page reloads) | Inspect the <form> action + inputs |
| JS constructs a URL and redirects | Navigation API interceptor |
Real example: Exa OTP verification
Section titled “Real example: Exa OTP verification”The Exa auth page’s “VERIFY CODE” button calls /api/verify-otp, gets back
{hashedOtp, rawOtp}, then does window.location.href = callback_url_with_token.
The Navigation API interceptor captured the full callback URL, revealing the
token format is {bcrypt_hash}:{raw_code}.
This technique turned a “Playwright required” flow into a fully HTTPX-replayable one. See NextAuth OTP flow.
Combining with fetch interception
Section titled “Combining with fetch interception”For complete visibility, install both interceptors before triggering an action:
// Capture all fetch calls AND navigationswindow.__cap = { fetches: [], navigations: [] };
// Fetch interceptorconst origFetch = window.fetch;window.fetch = async (...args) => { const r = await origFetch(...args); const c = r.clone(); window.__cap.fetches.push({ url: typeof args[0] === 'string' ? args[0] : args[0]?.url, status: r.status, body: (await c.text()).substring(0, 3000), }); return r;};
// Navigation interceptornavigation.addEventListener('navigate', (e) => { window.__cap.navigations.push(e.destination.url); e.preventDefault();});Read everything after: evaluate { script: "JSON.stringify(window.__cap)" }
Read the Source
Section titled “Read the Source”When bundle scanning and interception give you the what but not the why, go read the library’s source code. This is especially valuable for well-known frameworks (NextAuth, Supabase, Clerk, Auth0) where the source is on GitHub.
Why this matters
Section titled “Why this matters”Minified bundle code tells you what the client does. The library source tells you what the server expects. These are two halves of the same flow.
Example: NextAuth email callback
Section titled “Example: NextAuth email callback”Bundle scanning revealed Exa calls /api/auth/callback/email?token=.... But
what does the server do with that token? Reading the
NextAuth callback source
revealed the critical line:
token: await createHash(`${paramToken}${secret}`)The server SHA-256 hashes token + NEXTAUTH_SECRET and compares with the
database. This told us the token format must be stable and deterministic — it
can’t be a random value. Combined with the Navigation API interception that
showed token = hashedOtp:rawOtp, we had the complete picture.
When to read the source
Section titled “When to read the source”| Signal | Action |
|---|---|
| Standard framework (NextAuth, Supabase, etc.) | Read the auth callback handler source |
Custom error messages (e.g. error=Verification) | Search the library source for that error string |
| Token/hash format is unclear | Read the token verification logic |
| Framework does something “impossible” | The source always reveals how |
Where to find it
Section titled “Where to find it”NextAuth: github.com/nextauthjs/next-auth/tree/main/packages/core/srcSupabase: github.com/supabase/authClerk: github.com/clerk/javascriptAuth0: github.com/auth0/nextjs-auth0Search the repo for the endpoint path (e.g. callback/email) or error message
(e.g. Verification) to find the relevant handler quickly.
GraphQL Schema Discovery via JS Bundles
Section titled “GraphQL Schema Discovery via JS Bundles”Production GraphQL endpoints almost never allow introspection queries. But the frontend JS bundles contain every query and mutation the app uses.
Technique: scan all JS chunks for operation names
Section titled “Technique: scan all JS chunks for operation names”import re
def discover_graphql_operations(html: str, base_url: str) -> set[str]: """Find all GraphQL operation names from the frontend JS bundles.""" chunks = re.findall(r'(/_next/static/chunks/[a-zA-Z0-9/_%-]+\.js)', html) operations = set() for chunk in chunks: js = fetch(f"{base_url}{chunk}") # Find query/mutation declarations for m in re.finditer(r'(?:query|mutation)\s+([A-Za-z_]\w*)\s*[\(\{]', js): operations.add(m.group(1)) return operationsWhat this finds
Section titled “What this finds”On Goodreads, scanning 18 JS chunks revealed 38 operations:
Queries (public reads): getReviews, getSimilarBooks, getSearchSuggestions,
getWorksByContributor, getWorksForSeries, getComments, getBookListsOfBook,
getSocialSignals, getWorkCommunityRatings, getWorkCommunitySignals, …
Queries (auth required): getUser, getViewer, getEditions,
getSocialReviews, getWorkSocialReviews, getWorkSocialShelvings, …
Mutations: RateBook, ShelveBook, UnshelveBook, TagBook, Like,
Unlike, CreateComment, DeleteComment
Extracting full query strings
Section titled “Extracting full query strings”Once you know the operation name, extract the full query with its variable shape:
def extract_query(js: str, operation_name: str) -> str | None: idx = js.find(f"query {operation_name}") if idx == -1: return None snippet = js[idx:idx + 3000] depth = 0 for i, c in enumerate(snippet): if c == "{": depth += 1 elif c == "}": depth -= 1 if depth == 0: return snippet[:i + 1].replace("\\n", "\n") return NoneThis gives you copy-pasteable GraphQL documents you can replay directly via HTTP POST.
Real example: Goodreads
Section titled “Real example: Goodreads”See skills/goodreads/public_graph.py for the full set of proven GraphQL queries
including getReviews, getSimilarBooks, getSearchSuggestions,
getWorksForSeries, and getWorksByContributor.
Public vs Auth Boundary Mapping
Section titled “Public vs Auth Boundary Mapping”After discovering operations, you need to determine which ones work anonymously (with just the public API key) and which require user session auth.
Technique: probe each operation and classify the error
Section titled “Technique: probe each operation and classify the error”Send each discovered operation to the public endpoint and classify the response:
| Response | Meaning |
|---|---|
200 with data | Public, works anonymously |
200 with errors: ["Not Authorized to access X on type Y"] | Partially public — the operation works but specific fields are viewer-scoped. Remove the blocked field and retry. |
200 with errors: ["MappingTemplate" / VTL error] | Requires auth — the AppSync resolver needs session context to even start |
403 or 401 | Requires auth at the transport level |
AppSync VTL errors as a signal
Section titled “AppSync VTL errors as a signal”AWS AppSync uses Velocity Template Language (VTL) resolvers. When a public request hits an auth-gated resolver, you get a distinctive error:
{ "errorType": "MappingTemplate", "message": "Error invoking method 'get(java.lang.Integer)' in [Ljava.lang.String; at velocity[line 20, column 55]"}This means: “the resolver tried to read user context from the auth token and failed.” It reliably indicates the operation needs authentication.
Field-level authorization
Section titled “Field-level authorization”GraphQL auth on AppSync is often field-level, not operation-level. A getReviews
query might work but including viewerHasLiked returns:
{ "message": "Not Authorized to access viewerHasLiked on type Review" }The fix: remove the viewer-scoped field from your query. The rest works fine publicly.
Goodreads boundary scorecard
Section titled “Goodreads boundary scorecard”| Operation | Public? | Notes |
|---|---|---|
getSearchSuggestions | Yes | Book search by title/author |
getReviews | Yes | Except viewerHasLiked and viewerRelationshipStatus |
getSimilarBooks | Yes | |
getWorksForSeries | Yes | Series book listings |
getWorksByContributor | Yes | Needs internal contributor ID (not legacy author ID) |
getUser | No | VTL error — needs session |
getEditions | No | VTL error — needs session |
getViewer | No | Viewer-only by definition |
getWorkSocialShelvings | Partial | May need session for full data |
Heterogeneous Page Stacks
Section titled “Heterogeneous Page Stacks”Large sites migrating to modern frontends have mixed page types. You need to identify which pages use which stack and adjust your extraction strategy.
How to identify the stack
Section titled “How to identify the stack”| Signal | Stack |
|---|---|
<script id="__NEXT_DATA__"> in HTML | Next.js (server-rendered, may have Apollo cache) |
| GraphQL/AppSync XHR traffic after page load | Modern frontend with GraphQL backend |
No __NEXT_DATA__, classic <div> structure, <meta> tags | Legacy server-rendered HTML |
window.__INITIAL_STATE__ or similar | React SPA with custom state hydration |
Goodreads example
Section titled “Goodreads example”| Page type | Stack | Extraction strategy |
|---|---|---|
Book pages (/book/show/) | Next.js + Apollo + AppSync | __NEXT_DATA__ for main data, GraphQL for reviews/similar |
Author pages (/author/show/) | Legacy HTML | Regex scraping |
Profile pages (/user/show/) | Legacy HTML | Regex scraping |
Search pages (/search) | Legacy HTML | Regex scraping |
Strategy: use structured extraction where available, fall back to HTML only where the site hasn’t migrated yet. As the site migrates pages, move your extractors to match.
Legacy HTML Scraping
Section titled “Legacy HTML Scraping”When a page has no structured data surface, regex scraping is the fallback.
Principles
Section titled “Principles”- Prefer specific anchors (IDs, class names,
itempropattributes) over positional matching - Use
re.S(dotall) for multi-line HTML patterns - Extract sections first, then parse within the section to reduce false matches
- Always strip and unescape HTML entities
Section extraction pattern
Section titled “Section extraction pattern”def section_between(html: str, start_marker: str, end_marker: str) -> str: start = html.find(start_marker) if start == -1: return "" end = html.find(end_marker, start) return html[start:end] if end != -1 else html[start:]When to stop scraping
Section titled “When to stop scraping”If you find yourself writing regex patterns longer than 3 lines, consider:
- Is there a
__NEXT_DATA__payload you missed? - Does the page make XHR calls you could replay directly?
- Can you use a headless browser to get the rendered DOM instead?
HTML scraping should be the strategy of last resort, not the first attempt.
Real-World Examples in This Repo
Section titled “Real-World Examples in This Repo”| Skill | Discovery technique | Reference |
|---|---|---|
skills/exa/ | JS bundle scanning for custom /api/verify-otp endpoint + Navigation API interception for token format + reading NextAuth source for server-side verification logic | exa.py, nextauth.md |
skills/goodreads/ | Next.js Apollo cache + AppSync GraphQL + JS bundle scanning | public_graph.py |
skills/austin-boulder-project/ | JS bundle config extraction (API key + namespace) | abp.py |
skills/claude/ | Session cookie capture via Playwright | claude-login.py |