How to Filter CDX API Results in the Internet Archive’s Wayback Machine

The CDX API behind the Wayback Machine offers powerful filtering tools to refine your results. Whether you’re looking for snapshots within a specific date range or only interested in certain response codes or content types, filters help you target exactly what you need.

Here’s how to use them effectively.

Filtering by Date Range

To narrow results to a specific time period, use the from= and to= parameters. These accept timestamps in the format used by the Wayback Machine:
yyyyMMddhhmmss

You can use as few digits as needed:

  • from=2010 will match any capture from the year 2010.

  • to=2011 will include all captures up through the end of 2011.

Example:

 
http://web.archive.org/cdx/search/cdx?url=archive.org&from=201001&to=201012

This retrieves all captures of archive.org during 2010.

Both boundaries are inclusive.

Filtering by Field Value

You can also filter based on the values in specific CDX fields, such as:

  • statuscode (HTTP status code like 200, 404, etc.)

  • mimetype (e.g., text/html, image/png)

  • digest (a hash representing content uniqueness)

  • Any field listed in the CDX output schema

The basic syntax is:

 
filter=field:regex

You can specify multiple filters in a single query, and they will all be applied together.

Example:

 
http://web.archive.org/cdx/search/cdx?url=archive.org&filter=statuscode:200

This returns only results where the HTTP status code was 200 (OK).

Excluding Matches with Negation

To exclude certain values, simply add an exclamation mark ! before the field name.

Example:

 
http://web.archive.org/cdx/search/cdx?url=archive.org&filter=!statuscode:200

This will return only snapshots where the status code was not 200.

You can chain filters like this for even more control.

Example with multiple exclusions:

 
http://web.archive.org/cdx/search/cdx?url=archive.org&filter=!statuscode:200&filter=!mimetype:text/html

This retrieves all captures that were neither successful (200) nor standard web pages (text/html).

Matching by Digest

The digest field is a hash that helps identify identical content. If you’re looking for captures that match (or don’t match) a specific piece of content, use the digest: filter.

Example:

 
http://web.archive.org/cdx/search/cdx?url=archive.org&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV

This returns only captures with that exact content fingerprint.

Filtering the Entire CDX Line (Advanced)

Advanced users can apply filters across the full CDX line. The CDX record is space-delimited and can be treated as a raw string for pattern matching using regular expressions.

While powerful, this method is harder to control and not commonly needed unless you have a specific non-field-based search in mind.

Limiting the Number of Results

You can always add a limit= parameter to control how many results you get back.

Example:

 
http://web.archive.org/cdx/search/cdx?url=archive.org&filter=!statuscode:200&limit=10

This fetches the first 10 results where the status code is not 200.

Practical Tip

When using multiple filters, results must match all of them. The filters are combined with a logical AND.

If you need to mix inclusion and exclusion logic carefully, test your queries in stages to make sure the filters are doing what you expect.

Recap of Useful Filters

PurposeFilter Syntax Example
Only successful pagesfilter=statuscode:200
Exclude image filesfilter=!mimetype:image/.*
Only HTML documentsfilter=mimetype:text/html
Exclude duplicate contentfilter=!digest:<hash>
Snapshots before a dateto=201501
Snapshots in a yearfrom=2010&to=2010

Filtering gives you fine-tuned control over large sets of archived web data. Whether you're cleaning up results, narrowing to specific formats, or hunting for unique changes over time, these tools are essential.

Comments