The Invisible Block. How to Check if robots.txt Affects Wayback Captures.

You find the domain. You see the calendar. You click the year.

But instead of the archived page, you get the message:

“This page has been excluded from the Wayback Machine.”

It’s one of the more frustrating dead ends in archival work. You know the site existed. You can see the URLs. But the content? Gone - or more precisely, hidden. Not because it was deleted or never captured, but because a little text file told the Wayback Machine to stay out.

That file is called robots.txt, and it plays a surprisingly powerful role in shaping what gets archived - and what doesn’t.

This article walks you through how robots.txt works, why it sometimes blocks access to archived material, and most importantly, how to check whether it’s the reason you’re seeing gaps in the web’s memory.

What Is robots.txt, and Why Does It Matter?

robots.txt is a plain text file placed at the root of a website (like https://example.com/robots.txt) that tells web crawlers which parts of the site they are allowed - or not allowed - to access.

It was originally designed to help search engines behave politely. But the Wayback Machine also respects robots.txt... usually. And that’s where things get complicated.

If a domain’s robots.txt file says “don’t crawl /blog/,” the Wayback crawler will skip it. And if that robots.txt is added or updated later, it can even cause previously captured pages to become unavailable - retroactively.

It’s like someone drawing a curtain over a window you thought was open.

Signs That robots.txt Might Be Blocking a Capture

The most obvious sign is a message like:

“Page cannot be displayed due to robots.txt restrictions.”

This means the snapshot exists - but access has been blocked.

You might also notice:

  • Certain folders or file types (like /css/, /js/, or /admin/) are consistently unavailable, even when the rest of the site loads fine.

  • Archive pages that used to work now return errors.

  • HTML pages load, but their stylesheets or scripts don’t - resulting in broken or unstyled views.

That last one is common when styles or scripts were excluded from crawling. If that sounds familiar, our guide on how to extract CSS and JavaScript from archived snapshots explains workarounds and recovery methods.

How to Check a Site’s robots.txt File

To inspect a site’s current crawling rules, just visit:

 
https://example.com/robots.txt

You’ll see something like this:

 
User-agent: * Disallow: /private/ Disallow: /downloads/

This tells all crawlers (User-agent: *) to avoid the listed paths. The Wayback Machine typically honors this file for its own bot, ia_archiver.

But here’s the twist: archive.org doesn’t always use the current version of robots.txt - it may refer to historical versions, or obey a robots.txt that was added later, depending on their policy at the time.

The Policy Has Changed (But the Effect Lingers)

For years, the Internet Archive honored robots.txt as a strict rule. If a domain blocked /, even previously captured pages were hidden from view.

But since 2017, that policy shifted. Now, archive.org may ignore robots.txt retroactively, especially if the domain is expired or the file blocks everything globally.

That said, some pages still remain inaccessible, either due to:

  • Permanent robots.txt entries still respected by the crawler

  • Legal takedowns or copyright requests

  • Archival gaps caused by robots.txt exclusions during active crawling years

This means that even if a domain is dead, the snapshot may still be missing due to past restrictions.

Want to Know If Your Target Was Blocked?

Use this trick:

  1. Visit the snapshot URL that should exist.

  2. If it shows a robots.txt block message, remove the timestamp from the URL and browse the capture index.

  3. Try different years - older versions may be accessible even if newer ones are blocked.

  4. Use archive.org’s CDX API to query all captures for a given URL and see if any exist.

  5. Finally, test the live robots.txt directly from the root - if the domain is still active.

This process won’t always restore access, but it will confirm whether the block is real or a side effect of crawler behavior.

The Softest Lock in the World

Robots.txt isn’t a wall. It’s a suggestion - one that most crawlers choose to respect.

But in the world of digital preservation, that tiny file has had an outsized impact. It has hidden sites, broken layouts, and silenced valuable versions of the past. All with a few lines of text.

The Wayback Machine does what it can. But it’s not immune to instructions left by site owners - or developers long gone.

So when you run into an invisible barrier in the archive, don’t just shrug. Check robots.txt.