Handling Huge CDX Results Without Losing Your Mind. Smart Ways to Use limit, offset, and collapse.

Kaudo

Some domains are lightweights in the archive, ten captures, maybe twenty. Others? You hit them with a CDX query and they dump tens of thousands of results back at you like a firehose with no valve.

If you’re working with old news sites, blogs, or anything that’s been online for more than a few years, the CDX API can return so much data that it becomes unusable. That’s when you need to tame the output and the best way to do it is by learning how to use limit, offset, and collapse together.

The Problem with Large CDX Queries

Let’s say you’re scanning an old music blog that started in 2008 and posted daily for a decade. Every page, every tag, every redirect - archive.org saved most of it. Now you want to see what’s still retrievable, but the API keeps timing out or your browser grinds to a halt.

That’s not just annoying, it makes deep archival work feel impossible. Even structured tools or scripts hit the wall eventually if you don’t break the problem down. The issue isn’t the data itself. It’s how you ask for it.

Use `limit` and `offset` to Slice the Archive

The limit parameter sets how many entries the CDX API should return at once. By default, it gives you a lot. That’s great if you’re only calling it once, but risky if you’re dealing with big domains.

Setting a lower limit, like 1000 or even 500, gives you more control. Pair that with offset, and you can move through the archive in manageable chunks.

It’s like paging through an old photo album one spread at a time, instead of dumping the whole thing onto the table.

Start with:

Then next run:

…and so on.

You’ll still get everything, but now it’s split into pieces your tools (and the server) can handle.

Collapse Repetitions and Noise

One reason CDX dumps feel overwhelming is repetition. A single popular URL might’ve been captured 50 times in one day and 5,000 times over a year. You don’t always need every version of it.

The collapse parameter helps clean this up. It groups similar captures together based on certain criteria, like URL keys or content digests, so you don’t get flooded with near-duplicates.

The most common use:

This will reduce the list to one record per unique URL, which is often enough when you're surveying structure or trying to identify which pages existed.

You can get more aggressive with:

which filters out identical content regardless of URL, but use that carefully. Sometimes pages really did change slightly, and you might miss that nuance.

4. When to Combine These (and Why)

Here’s where it all comes together. If you’re looking at a large domain and want to analyze it cleanly:

Use limit to cap the output
Use offset to page through chunks
Use collapse=urlkey to avoid getting buried in repetitive URLs

By combining all three, you get a clearer picture of what was captured, without overwhelming your tools or yourself.

This is the approach our Expired Domain Scanner takes when you want to extract pages from massive sites. It fetches data in stages, then lets you extract or inspect specific content afterward.

Things to Watch Out For

CDX queries don’t always behave how you expect, especially when you start paging with offsets or combining filters. If the archive is uneven (lots of captures one year, very few the next), your slices might come back lopsided. One page returns 1000 results, the next barely 100.

And if you add filters like filter=statuscode:200 or date ranges (from=2015&to=2019), they apply after pagination and not before. This can make it look like your later pages are empty, when in reality the filtering already removed everything from that chunk.

So don’t panic if offset 4000 suddenly returns nothing. It may just be a side effect of your filters narrowing things too much.

Where to Go Next

The best thing about using limit, offset, and collapse is that you stay in control. You can explore even the most bloated archive safely, and target only the parts that matter.

Want to get fancy? Combine this with MIME-type filters (filter=mimetype:text/html), specific URL paths (url=example.com/articles/*), or timestamp ranges. But always start with the basics. Get a feel for how CDX pagination works before adding complexity.

For a deeper dive into fine-tuning these parameters and how to avoid common pitfalls - we’ve got a full write-up here: How to control result limits in Wayback Machine CDX queries

It’s a bit like learning how to drive a stick shift - slower at first, but once you master it, you’ll be unstoppable...