How to Filter CDX API Results in the Internet Archive’s Wayback Machine
The CDX API behind the Wayback Machine offers powerful filtering tools to refine your results. Whether you’re looking for snapshots within a specific date range or only interested in certain response codes or content types, filters help you target exactly what you need.
Here’s how to use them effectively.
Filtering by Date Range
To narrow results to a specific time period, use the from=
and to=
parameters. These accept timestamps in the format used by the Wayback Machine:yyyyMMddhhmmss
You can use as few digits as needed:
from=2010
will match any capture from the year 2010.to=2011
will include all captures up through the end of 2011.
Example:
This retrieves all captures of archive.org
during 2010.
Both boundaries are inclusive.
Filtering by Field Value
You can also filter based on the values in specific CDX fields, such as:
statuscode
(HTTP status code like 200, 404, etc.)mimetype
(e.g.,text/html
,image/png
)digest
(a hash representing content uniqueness)Any field listed in the CDX output schema
The basic syntax is:
You can specify multiple filters in a single query, and they will all be applied together.
Example:
This returns only results where the HTTP status code was 200 (OK).
Excluding Matches with Negation
To exclude certain values, simply add an exclamation mark !
before the field name.
Example:
This will return only snapshots where the status code was not 200.
You can chain filters like this for even more control.
Example with multiple exclusions:
This retrieves all captures that were neither successful (200
) nor standard web pages (text/html
).
Matching by Digest
The digest
field is a hash that helps identify identical content. If you’re looking for captures that match (or don’t match) a specific piece of content, use the digest:
filter.
Example:
This returns only captures with that exact content fingerprint.
Filtering the Entire CDX Line (Advanced)
Advanced users can apply filters across the full CDX line. The CDX record is space-delimited and can be treated as a raw string for pattern matching using regular expressions.
While powerful, this method is harder to control and not commonly needed unless you have a specific non-field-based search in mind.
Limiting the Number of Results
You can always add a limit=
parameter to control how many results you get back.
Example:
This fetches the first 10 results where the status code is not 200.
Practical Tip
When using multiple filters, results must match all of them. The filters are combined with a logical AND.
If you need to mix inclusion and exclusion logic carefully, test your queries in stages to make sure the filters are doing what you expect.
Recap of Useful Filters
Purpose | Filter Syntax Example |
---|---|
Only successful pages | filter=statuscode:200 |
Exclude image files | filter=!mimetype:image/.* |
Only HTML documents | filter=mimetype:text/html |
Exclude duplicate content | filter=!digest:<hash> |
Snapshots before a date | to=201501 |
Snapshots in a year | from=2010&to=2010 |
Filtering gives you fine-tuned control over large sets of archived web data. Whether you're cleaning up results, narrowing to specific formats, or hunting for unique changes over time, these tools are essential.
Comments