Usage¶
Search for historical mementos (archived copies) of a URL. Download metadata about the mementos and/or the memento content itself.
Tutorial¶
What is the earliest memento of nasa.gov?¶
Instantiate a WaybackClient
.
In [1]: from wayback import WaybackClient
In [2]: client = WaybackClient()
Search for all Wayback’s records for nasa.gov.
In [3]: results = client.search('nasa.gov')
This statement should execute fairly quickly because it doesn’t actually do
much work. The object we get back, results
, is a generator, a “lazy”
object from which we can pull results, one at a time. As we pull items
out of it, it loads them as needed from the Wayback Machine in chronological
order. We can see that results
by itself is not informative:
In [4]: results
Out[4]: <generator object WaybackClient.search at 0x7fe4d20256d0>
There are couple ways to pull items out of generator like results
. One
simple way is to use the built-in Python function next()
, like so:
In [5]: record = next(results)
This takes a moment to run because, now that we’ve asked to see the first item in the generator, this lazy object goes to fetch a chunk of results from the Wayback Machine. Looking at the record in detail,
In [6]: record
Out[6]: CdxRecord(key='gov,nasa)/', timestamp=datetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=datetime.timezone.utc), url='http://www.nasa.gov/', mime_type='text/html', status_code=200, digest='MGIGF4GRGGF5GKV6VNCBAXOE3OR5BTZC', length=1811, raw_url='http://web.archive.org/web/19961231235847id_/http://www.nasa.gov/', view_url='http://web.archive.org/web/19961231235847/http://www.nasa.gov/')
we can find our answer: Wayback’s first memento of nasa.gov was in 1996. We
can use dot access on record
to access the timestamp specifically.
In [7]: record.timestamp
Out[7]: datetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=datetime.timezone.utc)
How many times does the word ‘mars’ appear on nasa.gov?¶
Above, we access the metadata for the oldest memento on nasa.gov, stored in
the variable record
. Starting from where we left off, we’ll access the
content of the memento and do a very simple analysis.
The Wayback Machine provides two ways to look at the data it has captured.
There is a copy edited for human viewers on the web, available at the record’s
view_url
, and there is the original copy of what was captured when the page
was originally scraped, availabe at the record’s raw_url
. For analysis
purposes, we generally want the raw_url
.
Let’s download the raw content using WaybackClient
. (You could download the
content directly with an HTTP library like requests
, but WaybackClient
adds extra tools for dealing with Wayback Machine servers.)
In [8]: response = client.get_memento(record.raw_url)
In [9]: content = response.content.decode()
We can use the built-in method count
on strings to count the number of
times that 'mars'
appears in the content.
In [10]: content.count('mars')
Out[10]: 30
This is case-sensitive, so to be more accurate we should convert the content to lowercase first.
In [11]: content.lower().count('mars')
Out[11]: 39
We picked up a couple additional occurrences that the original count missed.
API Documentation¶
The Wayback Machine exposes its data through two different mechanisms, implementing two different standards for archival data, the CDX API and the Memento API. We implement a Python client that can speak both.
-
class
wayback.
WaybackClient
(session=None)[source]¶ A client for retrieving data from the Internet Archive’s Wayback Machine.
You can use a WaybackClient as a context manager. When exiting, it will close the session it’s using (if you’ve passed in a custom session, make sure not to use the context manager functionality unless you want to live dangerously).
- Parameters
- session
requests.Session
, optional
- session
-
search
(self, url, *, matchType=None, limit=None, offset=None, fastLatest=None, gzip=None, from_date=None, to_date=None, filter_field=None, collapse=None, showResumeKey=True, resumeKey=None, page=None, pageSize=None, resolveRevisits=True, skip_malformed_results=True, previous_result=None, **kwargs)[source]¶ Search archive.org’s CDX API for all captures of a given URL.
This will automatically page through all results for a given search.
Returns an iterator of CdxRecord objects. The StopIteration value is the total count of found captures.
Note that even URLs without wildcards may return results with different URLs. Search results are matched by url_key, which is a SURT-formatted, canonicalized URL:
Does not differentiate between HTTP and HTTPS
Is not case-sensitive
Treats
www.
andwww*.
subdomains the same as no subdomain at all
Note not all CDX API parameters are supported. In particular, this does not support: output, fl, showDupeCount, showSkipCount, lastSkipTimestamp, showNumPages, showPagedIndex.
- Parameters
- urlstr
The URL to query for captures of.
- matchTypestr, optional
Must be one of ‘exact’, ‘prefix’, ‘host’, or ‘domain’. The default value is calculated based on the format of url.
- limitint, optional
Maximum number of results per page (this iterator will continue to move through all pages unless showResumeKey=False, though).
- offsetint, optional
Skip the first N results.
- fastLatestbool, optional
Get faster results when using a negative value for limit. It may return a variable number of results.
- gzipbool, optional
Whether output should be gzipped.
- from_datedatetime or date, optional
Only include captures after this date. Equivalent to the from argument in the CDX API. If it does not have a time zone, it is assumed to be in UTC.
- to_datedatetime or date, optional
Only include captures before this date. Equivalent to the to argument in the CDX API. If it does not have a time zone, it is assumed to be in UTC.
- filter_fieldstr, optional
A filter for any field in the results. Equivalent to the filter argument in the CDX API. (format: [!]field:regex)
- collapsestr, optional
Collapse consecutive results that match on a given field. (format: fieldname or fieldname:N – N is the number of chars to match.)
- showResumeKeybool, optional
If False, don’t continue to iterate through all pages of results. The default value is True
- resumeKeystr, optional
Start returning results from a specified resumption point/offset. The value for this is supplied by the previous page of results when showResumeKey is True.
- pageint, optional
If using paging start from this page number (note: paging, as opposed to the using resumeKey is somewhat complicated because of the interplay with indexes and index sizes).
- pageSizeint, optional
The number of index blocks to examine for each page of results. Index blocks generally cover about 3,000 items, so setting pageSize=1 might return anywhere from 0 to 3,000 results per page.
- resolveRevistsbool, optional
Attempt to resolve warc/revisit records to their actual content type and response code. Not supported on all CDX servers. Defaults to True.
- skip_malformed_resultsbool, optional
If true, don’t yield records that look like they have no actual memento associated with them. Some crawlers will erroneously attempt to capture bad URLs like http://mailto:someone@domain.com or http://data:image/jpeg;base64,AF34… and so on. This is a filter performed client side and is not a CDX API argument. (Default: True)
- previous_resultstr, optional
For internal use. The CDX API sometimes returns repeated results. This is used to track the previous result so we can filter out the repeats.
- **kwargs
Any additional CDX API options.
- Yields
- version: CdxRecord
A
CdxRecord
encapsulating one capture or revisit
- Raises
- UnexpectedResponseFormat
If the CDX response was not parseable.
References
-
get_memento
(self, url, exact=True, exact_redirects=None, target_window=86400)[source]¶ Fetch a memento from the Wayback Machine. This retrieves the content that was ultimately returned from a memento, following any redirects that were present at the time the memento was captured. (That is, if http://example.com/a redirected to http://example.com/b, this returns the memento for /b when you request /a.)
- Parameters
- urlstring
URL of memento in Wayback (e.g. http://web.archive.org/web/20180816111911id_/http://www.nws.noaa.gov/sp/)
- exactboolean, optional
If false and the requested memento either doesn’t exist or can’t be played back, this returns the closest-in-time memento to the requested one, so long as it is within target_window. Default: True
- exact_redirectsboolean, optional
If false and the requested memento is a redirect whose target doesn’t exist or or can’t be played back, this returns the closest- in-time memento to the intended target, so long as it is within target_window. If unset, this will be the same as exact.
- target_windowint, optional
If the memento is of a redirect, allow up to this many seconds between the capture of the redirect and the capture of the target URL. (Note this does NOT apply when the originally requested memento didn’t exist and wayback redirects to the next-closest-in- -time one. That will always raise a MementoPlaybackError.) Defaults to 86,400 (24 hours).
- Returns
- dictrequests.Response
An HTTP response with the content of the memento, including a history of any redirects involved. (For a complete history of all HTTP requests needed to obtain the memento [rather than historic redirects], check debug_history instead of history.)
-
class
wayback.
CdxRecord
(key, timestamp, url, mime_type, status_code, digest, length, raw_url, view_url)¶ Item from iterable of results returned by
WaybackClient.search()
These attributes contain information provided directly by CDX.
-
digest
¶ Content hashed as a base 32 encoded SHA-1.
-
key
¶ SURT-formatted URL
-
length
¶ Size of captured content in bytes, such as
2767
. This may be innacurate. If the record is a “revisit record”, indicated by MIME type'warc/revisit'
, the length seems to be the length of the reference, not the length of the content itself.
-
mime_type
¶ MIME type of record, such as
'text/html'
,'warc/revisit'
or'unk'
(“unknown”) if this information was not captured.
-
status_code
¶ Status code returned by the server when the record was captured, such as
200
. This is may beNone
if the record is a revisit record.
-
timestamp
¶ The capture time represented as a
datetime.datetime
, such asdatetime.datetime(1996, 12, 31, 23, 58, 47)
.
-
url
¶ The URL that was captured by this record, such as
'http://www.nasa.gov/'
.
And these attributes are synthesized from the information provided by CDX.
-
raw_url
¶ The URL to the raw captured content, such as
'http://web.archive.org/web/19961231235847id_/http://www.nasa.gov/'
.
-
view_url
¶ The URL to the public view on Wayback Machine. In this view, the links and some subresources in the document are rewritten to point to Wayback URLs. There is also a navigation panel around the content. Example URL:
'http://web.archive.org/web/19961231235847/http://www.nasa.gov/'
.
-
-
class
wayback.
WaybackSession
(retries=6, backoff=2, timeout=None, user_agent=None)[source]¶ A custom session object that network pools connections and resources for requests to the Wayback Machine.
- Parameters
- retriesint, optional
The maximum number of retries for requests.
- backoffint or float, optional
Number of seconds from which to calculate how long to back off and wait when retrying requests. The first retry is always immediate, but subsequent retries increase by powers of 2:
seconds = backoff * 2 ^ (retry number - 1)
So if this was 4, retries would happen after the following delays: 0 seconds, 4 seconds, 8 seconds, 16 seconds, …
- timeoutint or float or tuple of (int or float, int or float), optional
A timeout to use for all requests. If not set, there will be no no explicit timeout. See the Requests docs for more: http://docs.python-requests.org/en/master/user/advanced/#timeouts
- user_agentstr, optional
A custom user-agent string to use in all requests. Defaults to: wayback/{version} (+https://github.com/edgi-govdata-archiving/wayback)
Utility Functions¶
-
wayback.
memento_url_data
(memento_url)[source]¶ Get the original URL and date that a memento URL represents a capture of.
Examples
Extract original URL and date.
>>> url = ('http://web.archive.org/web/20170813195036/' ... 'https://arpa-e.energy.gov/?q=engage/events-workshops') >>> memento_url_data(url) ('https://arpa-e.energy.gov/?q=engage/events-workshops', datetime.datetime(2017, 8, 13, 19, 50, 36))
Exception Classes¶
-
class
wayback.exceptions.
WaybackException
[source]¶ Base exception class for all Wayback-specific errors.
-
class
wayback.exceptions.
UnexpectedResponseFormat
[source]¶ Raised when data returned by the Wayback Machine is formatted in an unexpected or unparseable way.
-
class
wayback.exceptions.
BlockedByRobotsError
[source]¶ Raised when a URL can’t be queried in Wayback because it was blocked by a site’s robots.txt file.
-
class
wayback.exceptions.
BlockedSiteError
[source]¶ Raised when a URL has been blocked from access or querying in Wayback. This is often because of a takedown request. (URLs that are blocked because of
robots.txt
get aBlockedByRobotsError
instead.)
-
class
wayback.exceptions.
MementoPlaybackError
[source]¶ Raised when a Memento can’t be ‘played back’ (loaded) by the Wayback Machine for some reason. This is a server-side issue, not a problem in parsing data from Wayback.
-
class
wayback.exceptions.
RateLimitError
(response)[source]¶ Raised when the Wayback Machine responds with a 429 (too many requests) status code. In general, this package’s built-in limits should help you avoid ever hitting this, but if you are running multiple processes in parallel, you could go overboard.
- Attributes
- retry_afterint, optional
Recommended number of seconds to wait before retrying. If the Wayback Machine does not include it in the HTTP response, it will be set to
None
.
-
class
wayback.exceptions.
WaybackRetryError
(retries, total_time, causal_error)[source]¶ Raised when a request to the Wayback Machine has been retried and failed too many times. The number of tries before this exception is raised generally depends on your WaybackSession settings.
- Attributes
- retriesint
The number of retries that were attempted.
- causeException
The actual, underlying error that would have caused a retry.
- timeint
The total time spent across all retried requests, in seconds.