How to Use the CDX API to List All Captures of a Domain
Sometimes you don’t want just one snapshot. You want all of them. Every archived version of a homepage, every shift in a login screen, every removed subpage or forgotten download link.
That’s where the Wayback Machine’s CDX API comes in. It’s not flashy, but it’s one of the most powerful ways to extract full capture data from archive.org, especially when you’re doing OSINT, investigating a domain’s history, or trying to rebuild a dead site from the ground up.
The interface is plain. The results are raw. But what it gives you is control.
Here’s how to use the CDX API to list every capture of a domain - and why that list is more valuable than it looks.
What the CDX API Actually Is
The CDX API is a public, open endpoint offered by archive.org that returns structured information about archived content - URL, timestamp, status code, MIME type, and more. It’s what powers the Wayback Machine’s own interface behind the scenes.
With a single query, you can request every snapshot of a domain or page, sort it, filter it, and analyze it offline.
The format is simple and efficient. And because it skips all the interface overhead, it’s ideal for automation, scripting, or just digging without distractions.
In the world of OSINT, this kind of access is gold. As we outlined in our article on Archive.org's role in open-source intelligence, the ability to trace full URL histories is often what separates guesswork from verification.
How to Write a Basic Query
To start, just build a URL like this:
Replace example.com
with your target domain or page. The API returns a JSON array of all captures, with details like:
Timestamp (e.g., 20200405120030)
Original URL
MIME type (e.g., text/html)
HTTP status code (e.g., 200, 301, 404)
Digest (used for de-duplication)
It’s minimal, but extremely precise. Want to sort by time? Add &limit=5
or &from=20150101
. Want only full HTML pages? Filter by &filter=mimetype:text/html
.
You can pull hundreds or even thousands of entries this way. The output can be piped into a spreadsheet, a parser, or a comparison tool.
Collapse, Offset, and Pagination Tricks
For large domains, you’ll hit thousands of captures. That’s where the collapse
parameter helps. Use collapse=urlkey
to list only unique URLs. Use collapse=digest
to reduce duplicate captures.
Add offset=1000
and limit=1000
to paginate through long result sets. This helps if you’re scripting a complete scan or rebuilding an archived site structure.
For deep content - like subdomains or hidden assets—combining &matchType=prefix
with smart filtering lets you zero in on forgotten corners.
This kind of granular control is essential when you're archiving fast-moving or media-heavy content. Take Instagram, for instanc, one of the hardest platforms to preserve. We covered some of the headaches (and fixes) in our piece on archiving Instagram content, where CDX-based scraping often plays a behind-the-scenes role in preserving off-platform media.
Use Cases That Actually Matter
Here’s where CDX data becomes more than just a nerd trick.
You can:
Track the first appearance of a domain in the archive
See when a site was most active by capture volume
Reconstruct full page histories, including now-deleted subpages
Discover vanity URLs, short-lived beta pages, or forgotten admin paths
Verify timing in legal or forensic contexts
It’s especially valuable when doing long-term timeline work or when archive.org’s interface is too slow or incomplete for what you’re after.
Many analysts, journalists, and digital historians start with the Wayback UI but finish with CDX.
Final Thought: Structured Data Is Memory with a Backbone
A saved webpage is good. A list of every version it ever had is better.
That’s what the CDX API gives you, a clean, exportable way to see a domain’s past laid out in sequence. Not just as screenshots or artifacts, but as structured history you can work with.
You don’t need to be a developer to use it. Just a little curiosity, a browser bar, and the willingness to dig past the surface.
Because once you know what changed - and when - you can start asking the real questions. And more often than not, the answers are already there, just waiting in the list.