What Is a CDX File and How It Helps You Work with Archive.org

The Wayback Machine is a fascinating tool. You type in a URL, and suddenly you’re back in 2005 looking at an old homepage you forgot ever existed. But behind that magical experience is a quiet, invisible structure that makes it all possible — the CDX file.

This article explains what a CDX file is, how it works, and why it matters if you’re trying to analyze or rebuild content from archive.org.

What is a CDX file?

A CDX file is a simple text-based index used by archive.org to keep track of all archived web pages. Every snapshot saved by the Wayback Machine is recorded in a line of the CDX file.

Think of it as a database — only it’s a plain text list, where each line contains information about one archived capture: the URL, the time it was captured, the file type, status code, and more.

This file is what allows archive.org to instantly tell you which versions of a page are available and when they were captured.

Why should you care?

If you’re just casually browsing archive.org, you don’t need to worry about CDX. But if you:

  • work with web history,
  • analyze lost websites,
  • restore old content,
  • or build tools that rely on archived web data,

then understanding CDX is essential.

It gives you direct access to structured metadata about what was archived, when, and under what conditions — without needing to click through snapshots manually.

What does a CDX line look like?

Here’s an example of a single line from a CDX file:

com,example)/index 20210101120000 http://example.com/index text/html 200 ABC123 -

Let’s break it down:

FieldDescription
com,example)/indexThe reversed domain and page path (used for sorting)
20210101120000The capture timestamp (format: YYYYMMDDhhmmss)
http://example.com/indexThe original URL
text/htmlMIME type of the content
200HTTP status code returned at the time of capture
ABC123Digest or checksum of the content

This line tells us that on January 1, 2021 at 12:00, the page http://example.com/index was saved as HTML and returned a status code 200.

How can you access CDX data?

The Internet Archive provides a public CDX Server API, which you can query to retrieve capture data for any site.

Here’s a basic example:

https://web.archive.org/cdx/search/cdx?url=example.com&output=json&limit=5

This query asks archive.org to return the five most recent captures for example.com, formatted in JSON.

You can also use optional parameters like:

  • matchType=domain — to include subdomains
  • from=20150101 — to filter by date
  • filter=statuscode:200 — to include only successful responses

For developers and researchers, this means full control over how you access and filter archived data — without manually browsing.

What can you do with CDX?

Here are just a few examples of how CDX data can be used:

  • Restore a deleted website: Get the list of archived pages, timestamps, and URLs, then rebuild the site from snapshots.
  • Analyze site history: Track how a site evolved over time by comparing captures.
  • Recover lost content: Find and extract text from pages that no longer exist online.
  • SEO forensics: Identify when a site had broken links, redirects, or changes in content types.

Understanding CDX is good for your health

CDX files may not be well known, but they are the backbone of the Wayback Machine. If you’re serious about working with web archives, understanding CDX is not optional — it’s the most efficient way to interact with archived data at scale.

At Smartial.net, we build tools that use CDX data behind the scenes, so you don’t have to. But if you’re curious and want to dig deeper, learning how CDX works is a great place to start.

Comments