Please, Clean Captures Only. How to Fetch Only 200 OK Results with CDX API Filters.

When you’re querying the Wayback Machine through the CDX API, you’re not just pulling up a list of snapshots - you’re opening a door to everything archive.org has seen for a given domain or page: redirects, errors, incomplete loads, even spam captures.

It’s useful data. But if you’re trying to rebuild a site, audit content changes, or simply extract clean HTML for analysis, you don’t want to sift through every failed attempt or redirect loop.

You want pages that loaded successfully. That actually worked.
You want 200 OK.

In this guide, we’ll walk through how to use CDX API filters to return only successful captures - those that responded with HTTP status code 200. Whether you’re doing page comparisons, bulk archiving, or looking for reliable renderings, this is one of the most effective ways to clean up your dataset before you even start processing it.

And if you're planning to track how a page evolved over time, filtering by 200 status makes it easier to use tools like the Wayback Machine “Changes” feature without noise from broken or skipped snapshots.

Why Status Codes Matter in Archival Queries

The CDX API gives you a structured way to pull all available captures for a URL or domain. But unless you tell it otherwise, it returns everything: pages that failed to load, ones that redirected somewhere else, error pages, and even blank or malformed responses.

The result? Long, messy lists. And unless you're careful, you might build a timeline that includes snapshots of "404 Not Found" pages or redirects that point you in circles.

Filtering by statuscode:200 solves that.

HTTP 200 means the server successfully returned the content. No errors, no detours. Just a clean page - exactly what you want when reconstructing a digital footprint or extracting page text for analysis.

The Core Query

To request only 200 OK captures, add this to your CDX API call:

 
filter=statuscode:200

A full example might look like:

 
https://web.archive.org/cdx/search/cdx?url=example.com&output=json&filter=statuscode:200

This will return only those captures where archive.org saw a valid, successful page load. You’ll still get all the usual fields (timestamp, original URL, mime type, etc.), but filtered down to what’s actually usable.

This small addition can dramatically reduce the size and noise of your result set - especially for domains with long histories and lots of inconsistent traffic.

Combining with Other Filters

You can go further. CDX API allows multiple filters stacked together.

Want only successful HTML pages?

 
filter=statuscode:200&filter=mimetype:text/html

Need to collapse near-duplicates, like same content over a short period?

 
filter=statuscode:200&collapse=digest

This gives you one clean capture per unique page content - handy for detecting actual updates, not just re-crawls.

Looking for one good snapshot per day or month? Use a collapse=timestamp:8 or collapse=timestamp:6 to compress the timeline to daily or monthly frequency, respectively.

These filters give you precision - essential when you're not just collecting data, but trying to make sense of it.

Checking for Edge Cases

Sometimes you’ll find snapshots that claim 200 OK but still load poorly - missing CSS, broken images, or blank JavaScript-rendered content. This is common for modern web apps and older dynamically-loaded pages.

To work around that:

  • Prefer older static pages or endpoints if possible

  • Use filter=mimetype:text/html to weed out raw file captures or misclassified binary responses

  • Always spot-check a few results manually before trusting a dataset at scale

And remember: archive.org doesn’t always capture supporting files alongside HTML. If you're rebuilding a site or auditing layout, you may need to fetch CSS and JS separately.

Ask Less But Get More!

CDX API is powerful - but like any open dataset, it’s only useful when you know how to shape the response. Filtering by statuscode:200 is one of the simplest, most effective ways to clean up your queries and focus on what matters.

Because when you’re reconstructing a web page, building timelines, or comparing changes, failed loads just get in the way.

So trim the fat. Ask smart.