How to Filter CDX API Results in the Internet Archive’s Wayback Machine

Kaudo

The CDX API behind the Wayback Machine offers powerful filtering tools to refine your results. Whether you’re looking for snapshots within a specific date range or only interested in certain response codes or content types, filters help you target exactly what you need.

Here’s how to use them effectively.

Filtering by Date Range

To narrow results to a specific time period, use the from= and to= parameters. These accept timestamps in the format used by the Wayback Machine:
yyyyMMddhhmmss

You can use as few digits as needed:

from=2010 will match any capture from the year 2010.
to=2011 will include all captures up through the end of 2011.

Example:

This retrieves all captures of archive.org during 2010.

Both boundaries are inclusive.

Filtering by Field Value

You can also filter based on the values in specific CDX fields, such as:

statuscode (HTTP status code like 200, 404, etc.)
mimetype (e.g., text/html, image/png)
digest (a hash representing content uniqueness)
Any field listed in the CDX output schema

The basic syntax is:

You can specify multiple filters in a single query, and they will all be applied together.

Example:

This returns only results where the HTTP status code was 200 (OK).

Excluding Matches with Negation

To exclude certain values, simply add an exclamation mark ! before the field name.

Example:

This will return only snapshots where the status code was not 200.

You can chain filters like this for even more control.

Example with multiple exclusions:

This retrieves all captures that were neither successful (200) nor standard web pages (text/html).

Matching by Digest

The digest field is a hash that helps identify identical content. If you’re looking for captures that match (or don’t match) a specific piece of content, use the digest: filter.

Example:

This returns only captures with that exact content fingerprint.

Filtering the Entire CDX Line (Advanced)

Advanced users can apply filters across the full CDX line. The CDX record is space-delimited and can be treated as a raw string for pattern matching using regular expressions.

While powerful, this method is harder to control and not commonly needed unless you have a specific non-field-based search in mind.

Limiting the Number of Results

You can always add a limit= parameter to control how many results you get back.

Example:

This fetches the first 10 results where the status code is not 200.

Practical Tip

When using multiple filters, results must match all of them. The filters are combined with a logical AND.

If you need to mix inclusion and exclusion logic carefully, test your queries in stages to make sure the filters are doing what you expect.

Recap of Useful Filters

Purpose	Filter Syntax Example
Only successful pages	`filter=statuscode:200`
Exclude image files	`filter=!mimetype:image/.*`
Only HTML documents	`filter=mimetype:text/html`
Exclude duplicate content	`filter=!digest:<hash>`
Snapshots before a date	`to=201501`
Snapshots in a year	`from=2010&to=2010`

Filtering gives you fine-tuned control over large sets of archived web data. Whether you're cleaning up results, narrowing to specific formats, or hunting for unique changes over time, these tools are essential.

Filtering by Date Range

Filtering by Field Value

Excluding Matches with Negation

Matching by Digest

Filtering the Entire CDX Line (Advanced)

Limiting the Number of Results

Practical Tip

Recap of Useful Filters

Comments

You should also read: