How to use Wayback Machine to search for archived non-text files, including files from long-expired, defunct domains?
I will not repeat the benefits of Smartial tools here; they are instrumental when searching for archived web pages and scraping their text content. Many of you actively use them and, after all, I use them myself daily too.
However, I would also like to draw your attention to other WM features that may not be so well known.
A few years ago, when I started developing a set of tools to make it easier for myself to work with Wayback Machine (and I eventually decided to publish them as a free Smartial toolkit recently), I was exclusively interested in archived texts (e.g. HTML pages). I was convinced that the Wayback Machine focused solely on archiving textual content and nothing else.
Do never underestimate THE ARCHIVE!
However, I later found out that I underestimated the possibilities and capabilities of the Archive.
Non-text content is only a fraction of that huge amount of data that is in the Archive, but this does not change the fact that Wayback Machine crawlers try to detect and archive all comprehensible content they find, including all non-text files, whenever possible.
Once upon a time, in 2010-2011, I regularly spent my spare time on an Internet forum called Sickmarketing. This forum revolved around their product called Sick Submitter, which was a very popular blackhat tool at the time.
Although I was never a rigorous blackhat marketer, I enjoyed Sick Submitter which at that time was a really good tool and the community around this tool was also quite nice.
Sick Submitter was later pushed out of the market by other, more advanced tools and the site went down. But I found myself in a situation when I desperately needed to dig out some PDF documents that I was sure were published on the Sickmarketing site.
But how to do it? Having no other chance, I started thinking (my last resort :-)), studying the Wayback Machine and its API, and I soon realized that even such an awkward newbie programmer like me could program it without any major complications.
This is how my file sniffer was created, at that time I called it only a PDF sniffer, as I didn’t necessarily need to search for other types of files that Wayback Machine archives.
Today I’m adding File Sniffer among other free Smartial tools, use it as you like :-).
Just to show you what it looks like when you unleash the Smartial Sniffer on a long-forgotten site of a once popular blackhat legend.
How to use Smartial File Sniffer?
It is easy. Another Smartial tool, Domain Scanner, is very similar in logic.
Launch the Sniffer, enter the domain name or URL prefix in the search box, select the type of files you are interested in, press the “sniff” button, and the Sniffer does the rest.
There are several default file types (PDF, MS Word DOC files, XLS sheets, ZIP archives, image, video and audio files, etc …) and if you do not know exactly what you are looking for, there is also the option to search through all non-text archived files from a given website, so that you get an overview what files are hidden in the archives.
How to find expired audio and video?
Let’s show off Sniffer’s features on the next expired page.
Most users of Smartial Tools are SEO specialists or assistants, so let’s say they might be interested in some expired documents related to SEO. And since we’re lazy and don’t want to read too much, let’s try to find some expired audio or video files related to search engine optimization.
Let’s start with the search. We choose “SEO podcasts” as the keyword. We will choose the “check DNS records” option, as the found pages will be about 95% really expired, although in this case, it does not matter much, as we do not intend to copy or reuse the files, we just want to listen to them.
And this is what we get…
I have never heard of seo-podcasts.de, in fact, I have never listened to any German podcast, so let us try and copy this site URL. Then we will proceed to the Sniffer (do not click the link just copy the URL and go to Sniffer).
Not that many results but we can clearly see that there are still 11 podcasts available. And they still sound good, try it yourself :-).
We can do the same with videos. Just enter a domain name of a site you know that contained some self-hosted video files, even though they are quite rare thanks to Youtube.
And we can do the same with the other file types too.
Quick search for all non-text files
Of course, in most cases we do not know what treasures are hidden on the archived website, in such case we can use the option to search the entire archive of the site and find all non-text files.
Only then, when we see that there actually are some PDFs, MP3s, TXTs or whatever we are really interested in, only then we can filter the list more accurately.
By the way, did you know you can make your own mini-preps to isolate small plasmid DNA from bacteria from a beer bottle? Interesting isn’t it?
Be aware of legal impacts of what you do
When rummaging through old data and files on long-expired domains, always keep in mind that things you find may be subject to valid copyright.
Copyright infringement can lead to legislative action by the author of the document or file.
Stay open-minded and tuned in…