Usage¶

Search for historical mementos (archived copies) of a URL. Download metadata about the mementos and/or the memento content itself.

Tutorial¶

What is the earliest memento of nasa.gov?¶

Instantiate a WaybackClient.

In [1]: from wayback import WaybackClient

In [2]: client = WaybackClient()

Search for all Wayback’s records for nasa.gov.

In [3]: results = client.search('nasa.gov')

This statement should execute fairly quickly because it doesn’t actually do much work. The object we get back, results, is a generator, a “lazy” object from which we can pull results, one at a time. As we pull items out of it, it loads them as needed from the Wayback Machine in chronological order. We can see that results by itself is not informative:

In [4]: results
Out[4]: <generator object WaybackClient.search at 0x7fb97a4525d0>

There are couple ways to pull items out of generator like results. One simple way is to use the built-in Python function next(), like so:

In [5]: record = next(results)

This takes a moment to run because, now that we’ve asked to see the first item in the generator, this lazy object goes to fetch a chunk of results from the Wayback Machine. Looking at the record in detail,

In [6]: record
Out[6]: CdxRecord(key='gov,nasa)/', timestamp=datetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=datetime.timezone.utc), url='http://www.nasa.gov/', mime_type='text/html', status_code=200, digest='MGIGF4GRGGF5GKV6VNCBAXOE3OR5BTZC', length=1811, raw_url='http://web.archive.org/web/19961231235847id_/http://www.nasa.gov/', view_url='http://web.archive.org/web/19961231235847/http://www.nasa.gov/')

we can find our answer: Wayback’s first memento of nasa.gov was in 1996. We can use dot access on record to access the timestamp specifically.

In [7]: record.timestamp
Out[7]: datetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=datetime.timezone.utc)

How many times does the word ‘mars’ appear on nasa.gov?¶

Above, we access the metadata for the oldest memento on nasa.gov, stored in the variable record. Starting from where we left off, we’ll access the content of the memento and do a very simple analysis.

The Wayback Machine provides multiple playback modes to view the data it has captured. The wayback.Mode.view mode is a copy edited for human viewers on the web, and the wayback.Mode.original mode is the original copy of what was captured when the page was scraped. For analysis purposes, we generally want original. (Check the documentation of wayback.Mode for a few other, less commonly used modes.)

Let’s download the original content using WaybackClient. (You could download the content directly with an HTTP library like requests, but WaybackClient adds extra tools for dealing with Wayback Machine servers.)

In [8]: from wayback import Mode

# `Mode.original` is the default and doesn't need to be explicitly set;
# we've set it here to show how you might choose other modes.
In [9]: response = client.get_memento(record, mode=Mode.original)

In [10]: content = response.content.decode()

We can use the built-in method count on strings to count the number of times that 'mars' appears in the content.

In [11]: content.count('mars')
Out[11]: 30

This is case-sensitive, so to be more accurate we should convert the content to lowercase first.

In [12]: content.lower().count('mars')
Out[12]: 39

We picked up a couple additional occurrences that the original count missed.

API Documentation¶

The Wayback Machine exposes its data through two different mechanisms, implementing two different standards for archival data, the CDX API and the Memento API. We implement a Python client that can speak both.

class wayback.WaybackClient(session=None)[source]¶

A client for retrieving data from the Internet Archive’s Wayback Machine.

You can use a WaybackClient as a context manager. When exiting, it will close the session it’s using (if you’ve passed in a custom session, make sure not to use the context manager functionality unless you want to live dangerously).

Parameters

sessionrequests.Session, optional

search(self, url, *, matchType=None, limit=None, offset=None, fastLatest=None, gzip=None, from_date=None, to_date=None, filter_field=None, collapse=None, showResumeKey=True, resumeKey=None, page=None, pageSize=None, resolveRevisits=True, skip_malformed_results=True, previous_result=None, **kwargs)[source]¶

Search archive.org’s CDX API for all captures of a given URL.

This will automatically page through all results for a given search.

Returns an iterator of CdxRecord objects. The StopIteration value is the total count of found captures.

Note that even URLs without wildcards may return results with different URLs. Search results are matched by url_key, which is a SURT-formatted, canonicalized URL:

Does not differentiate between HTTP and HTTPS
Is not case-sensitive
Treats www. and www*. subdomains the same as no subdomain at all

Note not all CDX API parameters are supported. In particular, this does not support: output, fl, showDupeCount, showSkipCount, lastSkipTimestamp, showNumPages, showPagedIndex.

Parameters

urlstr: The URL to query for captures of.
matchTypestr, optional: Must be one of ‘exact’, ‘prefix’, ‘host’, or ‘domain’. The default value is calculated based on the format of url.
limitint, optional: Maximum number of results per page (this iterator will continue to move through all pages unless showResumeKey=False, though).
offsetint, optional: Skip the first N results.
fastLatestbool, optional: Get faster results when using a negative value for limit. It may return a variable number of results.
gzipbool, optional: Whether output should be gzipped.
from_datedatetime or date, optional: Only include captures after this date. Equivalent to the from argument in the CDX API. If it does not have a time zone, it is assumed to be in UTC.
to_datedatetime or date, optional: Only include captures before this date. Equivalent to the to argument in the CDX API. If it does not have a time zone, it is assumed to be in UTC.
filter_fieldstr, optional: A filter for any field in the results. Equivalent to the filter argument in the CDX API. (format: [!]field:regex)
collapsestr, optional: Collapse consecutive results that match on a given field. (format: fieldname or fieldname:N – N is the number of chars to match.)
showResumeKeybool, optional: If False, don’t continue to iterate through all pages of results. The default value is True
resumeKeystr, optional: Start returning results from a specified resumption point/offset. The value for this is supplied by the previous page of results when showResumeKey is True.
pageint, optional: If using paging start from this page number (note: paging, as opposed to the using resumeKey is somewhat complicated because of the interplay with indexes and index sizes).
pageSizeint, optional: The number of index blocks to examine for each page of results. Index blocks generally cover about 3,000 items, so setting pageSize=1 might return anywhere from 0 to 3,000 results per page.
resolveRevistsbool, optional: Attempt to resolve warc/revisit records to their actual content type and response code. Not supported on all CDX servers. Defaults to True.
skip_malformed_resultsbool, optional: If true, don’t yield records that look like they have no actual memento associated with them. Some crawlers will erroneously attempt to capture bad URLs like http://mailto:someone@domain.com or http://data:image/jpeg;base64,AF34… and so on. This is a filter performed client side and is not a CDX API argument. (Default: True)
previous_resultstr, optional: For internal use. The CDX API sometimes returns repeated results. This is used to track the previous result so we can filter out the repeats.
**kwargs: Any additional CDX API options.

Yields

version: CdxRecord: A CdxRecord encapsulating one capture or revisit

Raises

UnexpectedResponseFormat: If the CDX response was not parseable.

References

https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

get_memento(self, url, datetime=None, mode=<Mode.original: 'id_'>, *, exact=True, exact_redirects=None, target_window=86400, follow_redirects=True)[source]¶

Fetch a memento (an archived HTTP response) from the Wayback Machine.

Not all mementos can be successfully fetched (or “played back” in Wayback terms). In this case, get_memento can load the next-closest-in-time memento or it will raise wayback.exceptions.MementoPlaybackError depending on the value of the exact and exact_redirects parameters (see more details below).

Parameters

urlstring or CdxRecord

URL to retrieve a memento of. This can be any of:

A normal URL (e.g. http://www.noaa.gov/). When using this form, you must also specify datetime.
A CdxRecord retrieved from wayback.WaybackClient.search().
A URL of the memento in Wayback, e.g. http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/

datetimedatetime.datetime or datetime.date or str, optional

The time at which to retrieve a memento of url. If url is a wayback.CdxRecord or full memento URL, this parameter can be omitted.

modewayback.Mode or str, optional

The playback mode of the memento. This determines whether the content of the returned memento is exactly as originally captured (the default) or modified in some way. See wayback.Mode for a description of possible values.

For more details, see: http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode

Default: wayback.Mode.original

exactboolean, optional

If false and the requested memento either doesn’t exist or can’t be played back, this returns the closest-in-time memento to the requested one, so long as it is within target_window. If there was no memento in the target window or if exact=True, then this will raise wayback.exceptions.MementoPlaybackError. Default: True

exact_redirectsboolean, optional

If false and the requested memento is a redirect whose target doesn’t exist or can’t be played back, this returns the closest-in-time memento to the intended target, so long as it is within target_window. If unset, this will be the same as exact.

target_windowint, optional

If the memento is of a redirect, allow up to this many seconds between the capture of the redirect and the capture of the redirect’s target URL. This window also applies to the first memento if exact=False and the originally requested memento was not available. Defaults to 86,400 (24 hours).

follow_redirectsboolean, optional

If true (the default), get_memento will follow historical redirects to return the content that a web browser would have ultimately displayed at the requested URL and time, rather than the memento of an HTTP redirect response (i.e. a 3xx status code). That is, if http://example.com/a redirected to http://example.com/b, then this method returns the memento for /a when follow_redirects=False and the memento for /b when follow_redirects=True. Default: True

Returns

dictrequests.Response: An HTTP response with the content of the memento, including a history of any redirects involved. (For a complete history of all HTTP requests needed to obtain the memento [rather than historic redirects], check debug_history instead of history.)

class wayback.CdxRecord(key, timestamp, url, mime_type, status_code, digest, length, raw_url, view_url)¶

Item from iterable of results returned by WaybackClient.search()

These attributes contain information provided directly by CDX.

digest¶: Content hashed as a base 32 encoded SHA-1.

key¶: SURT-formatted URL

length¶: Size of captured content in bytes, such as 2767. This may be inaccurate. If the record is a “revisit record”, indicated by MIME type 'warc/revisit', the length seems to be the length of the reference, not the length of the content itself.

mime_type¶: MIME type of record, such as 'text/html', 'warc/revisit' or 'unk' (“unknown”) if this information was not captured.

status_code¶: Status code returned by the server when the record was captured, such as 200. This is may be None if the record is a revisit record.

timestamp¶: The capture time represented as a datetime.datetime, such as datetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=timezone.utc).

url¶: The URL that was captured by this record, such as 'http://www.nasa.gov/'.

And these attributes are synthesized from the information provided by CDX.

raw_url¶: The URL to the raw captured content, such as 'http://web.archive.org/web/19961231235847id_/http://www.nasa.gov/'.

view_url¶: The URL to the public view on Wayback Machine. In this view, the links and some subresources in the document are rewritten to point to Wayback URLs. There is also a navigation panel around the content. Example URL: 'http://web.archive.org/web/19961231235847/http://www.nasa.gov/'.

class wayback.Memento(*, url, timestamp, mode, memento_url, status_code, headers, encoding, raw, raw_headers, history, debug_history)[source]¶

Represents a memento (an archived HTTP response). This object is similar to a response object from the popular “Requests” package, although it has some differences designed to differentiate historical information vs. current metadata about the stored memento (for example, the headers attribute lists the headers recorded in the memento, and does not include additional headers that provide metadata about the Wayback Machine).

Note that, like an HTTP response, this object represents a potentially open network connection to the Wayback Machine. Reading the content or text attributes will read all the data being received and close the connection automatically, but if you do not read those properties, you must make sure to call close() to close to connection. Alternatively, you can use a Memento as a context manager. The connection will be closed for you when the context ends:

>>> with a_memento:
>>>     do_something()
>>> # Connection is automatically closed here.

Fields

encoding: str¶: The text encoding of the response, e.g. 'utf-8'.

headers: dict¶: A dict representing the headers of the archived HTTP response. The keys are case-sensitive.

history: tuple[wayback.Memento]¶: A list of wayback.Memento objects that were redirects and were followed to produce this memento.

debug_history: tuple[str]¶: List of all URLs redirects followed in order to produce this memento. These are “memento URLs” – that is, they are absolute URLs to the Wayback machine like http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/, rather than URLs of captured redirects, like http://www.noaa.gov. Many of the URLs in this list do not represent actual mementos.

status_code: int¶: The HTTP status code of the archived HTTP response.

mode: str¶: The playback mode used to produce the Memento.

timestamp: datetime.datetime¶: The time the memento was originally captured. This includes tzinfo, and will always be in UTC.

url: str¶: The URL that the memento represents, e.g. http://www.noaa.gov.

memento_url: str¶: The URL at which the memento was fetched from the Wayback Machine, e.g. http://web.archive.org/web/20180816111911id_/http://www.noaa.gov/.

ok: bool¶: Whether the response had an non-error status (i.e. < 400).

is_redirect: bool¶: Whether the response was a redirect (i.e. had a 3xx status).

content: bytes¶: The body of the archived HTTP response in bytes.

text: str¶: The body of the archived HTTP response decoded as a string.

close(self)[source]¶: Close the HTTP response for this Memento. This happens automatically if you read content or text, and if you use the memento as a context manager. This method is always safe to call – it does nothing if the response has already been closed.

classmethod parse_memento_headers(raw_headers, url='http://web.archive.org/')[source]¶

Extract historical headers from the Memento HTTP response’s headers.

Parameters

raw_headersdict: A dict of HTTP headers from the Memento’s HTTP response.
urlstr, optional: The URL of the resource the headers are being parsed for. It’s used when header data contains relative/incomplete URL information.

Returns

dict

class wayback.WaybackSession(retries=6, backoff=2, timeout=60, user_agent=None)[source]¶

A custom session object that network pools connections and resources for requests to the Wayback Machine.

Parameters

retriesint, optional

The maximum number of retries for requests.

backoffint or float, optional

Number of seconds from which to calculate how long to back off and wait when retrying requests. The first retry is always immediate, but subsequent retries increase by powers of 2:

seconds = backoff * 2 ^ (retry number - 1)

So if this was 4, retries would happen after the following delays: 0 seconds, 4 seconds, 8 seconds, 16 seconds, …

timeoutint or float or tuple of (int or float, int or float), optional

A timeout to use for all requests. (Default: 60) See the Requests docs for more: http://docs.python-requests.org/en/master/user/advanced/#timeouts

user_agentstr, optional

A custom user-agent string to use in all requests. Defaults to: wayback/{version} (+https://github.com/edgi-govdata-archiving/wayback)

reset(self)[source]¶: Reset any network connections the session is using.

Utilities¶

wayback.memento_url_data(memento_url)[source]¶

Get the original URL, time, and mode that a memento URL represents a capture of.

Returns

urlstr: The URL that the memento is a capture of.
timedatetime.datetime: The time the memento was captured in the UTC timezone.
modestr: The playback mode.

Examples

Extract original URL, time and mode.

>>> url = ('http://web.archive.org/web/20170813195036id_/'
...        'https://arpa-e.energy.gov/?q=engage/events-workshops')
>>> memento_url_data(url)
('https://arpa-e.energy.gov/?q=engage/events-workshops',
 datetime.datetime(2017, 8, 13, 19, 50, 36, tzinfo=timezone.utc),
 'id_')

class wayback.Mode(value)[source]¶

An enum describing the playback mode of a memento. When requesting a memento (e.g. with wayback.WaybackClient.get_memento()), you can use these values to determine how the response body should be formatted.

For more details, see: http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode

Examples

>>> waybackClient.get_memento('https://noaa.gov/',
>>>                           datetime=datetime.datetime(2018, 1, 2),
>>>                           mode=wayback.Mode.view)

Values

original¶: Returns the HTTP response body as originally captured.

view¶: Formats the response body so it can be viewed with a web browser. URLs for links and subresources like scripts, stylesheets, images, etc. will be modified to point to the equivalent memento in the Wayback Machine so that the resulting page looks as similar as possible to how it would have appeared when originally captured. It’s mainly meant for use with HTML pages. This is the playback mode you typically use when browsing the Wayback Machine with a web browser.

javascript¶: Formats the response body by updating URLs, similar to Mode.view, but designed for JavaScript instead of HTML.

css¶: Formats the response body by updating URLs, similar to Mode.view, but designed for CSS instead of HTML.

image¶: formats the response body similar to Mode.view, but designed for image files instead of HTML.

Exception Classes¶

class wayback.exceptions.WaybackException[source]¶: Base exception class for all Wayback-specific errors.

class wayback.exceptions.UnexpectedResponseFormat[source]¶: Raised when data returned by the Wayback Machine is formatted in an unexpected or unparseable way.

class wayback.exceptions.BlockedByRobotsError[source]¶: Raised when a URL can’t be queried in Wayback because it was blocked by a site’s robots.txt file.

class wayback.exceptions.BlockedSiteError[source]¶: Raised when a URL has been blocked from access or querying in Wayback. This is often because of a takedown request. (URLs that are blocked because of robots.txt get a BlockedByRobotsError instead.)

class wayback.exceptions.MementoPlaybackError[source]¶: Raised when a Memento can’t be ‘played back’ (loaded) by the Wayback Machine for some reason. This is a server-side issue, not a problem in parsing data from Wayback.

class wayback.exceptions.NoMementoError[source]¶

Raised when there was no memento available for a given URL. This might mean the given URL has no mementos at all or that none that are available for playback.

This also means you should not try to request a memento of the same URL in a different timeframe. If there may be other mementos of the URL available, you’ll get a different error.

class wayback.exceptions.RateLimitError(response)[source]¶

Raised when the Wayback Machine responds with a 429 (too many requests) status code. In general, this package’s built-in limits should help you avoid ever hitting this, but if you are running multiple processes in parallel, you could go overboard.

Attributes

retry_afterint, optional: Recommended number of seconds to wait before retrying. If the Wayback Machine does not include it in the HTTP response, it will be set to None.

class wayback.exceptions.WaybackRetryError(retries, total_time, causal_error)[source]¶

Raised when a request to the Wayback Machine has been retried and failed too many times. The number of tries before this exception is raised generally depends on your WaybackSession settings.

Attributes

retriesint: The number of retries that were attempted.
causeException: The actual, underlying error that would have caused a retry.
timeint: The total time spent across all retried requests, in seconds.

class wayback.exceptions.SessionClosedError[source]¶: Raised when a Wayback session is used to make a request after it has been closed and disabled.