Usage
Search for historical mementos (archived copies) of a URL. Download metadata about the mementos and/or the memento content itself.
Tutorial
What is the earliest memento of nasa.gov?
Instantiate a WaybackClient.
In [1]: from wayback import WaybackClient
In [2]: client = WaybackClient()
Search for all Wayback’s records for nasa.gov.
In [3]: results = client.search('nasa.gov')
This statement should execute fairly quickly because it doesn’t actually do
much work. The object we get back, results, is a generator, a “lazy”
object from which we can pull results, one at a time. As we pull items
out of it, it loads them as needed from the Wayback Machine in chronological
order. We can see that results by itself is not informative:
In [4]: results
Out[4]: <generator object WaybackClient.search at 0x55853ba9a8a0>
There are couple ways to pull items out of generator like results. One
simple way is to use the built-in Python function next(), like so:
In [5]: record = next(results)
This takes a moment to run because, now that we’ve asked to see the first item in the generator, this lazy object goes to fetch a chunk of results from the Wayback Machine. Looking at the record in detail,
In [6]: record
Out[6]: CdxRecord(urlkey='gov,nasa)/', timestamp=datetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=datetime.timezone.utc), original='http://www.nasa.gov/', mimetype='text/html', statuscode=200, digest='MGIGF4GRGGF5GKV6VNCBAXOE3OR5BTZC', length=1811)
we can find our answer: Wayback’s first memento of nasa.gov was in 1996. We
can use dot access on record to access the timestamp specifically.
In [7]: record.timestamp
Out[7]: datetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=datetime.timezone.utc)
How many times does the word ‘mars’ appear on nasa.gov?
Above, we access the metadata for the oldest memento on nasa.gov, stored in
the variable record. Starting from where we left off, we’ll access the
content of the memento and do a very simple analysis.
The Wayback Machine provides multiple playback modes to view the data it has
captured. The wayback.Mode.view mode is a copy edited for human viewers
on the web, and the wayback.Mode.original mode is the original copy of
what was captured when the page was scraped. For analysis purposes, we
generally want original. (Check the documentation of wayback.Mode
for a few other, less commonly used modes.)
Let’s download the original content using WaybackClient. (You could
download the content directly with an HTTP library like requests, but
WaybackClient adds extra tools for dealing with Wayback Machine servers.)
In [8]: from wayback import Mode
# `Mode.original` is the default and doesn't need to be explicitly set;
# we've set it here to show how you might choose other modes.
In [9]: response = client.get_memento(record, mode=Mode.original)
In [10]: content = response.content.decode()
We can use the built-in method count on strings to count the number of
times that 'mars' appears in the content.
In [11]: content.count('mars')
Out[11]: 30
This is case-sensitive, so to be more accurate we should convert the content to lowercase first.
In [12]: content.lower().count('mars')
Out[12]: 39
We picked up a couple additional occurrences that the original count missed.
API Documentation
The Wayback Machine exposes its data through two different mechanisms, implementing two different standards for archival data, the CDX API and the Memento API. We implement a Python client that can speak both.
- class wayback.WaybackClient(session=None)[source]
A client for retrieving data from the Internet Archive’s Wayback Machine.
You can use a WaybackClient as a context manager. When exiting, it will close the session it’s using (if you’ve passed in a custom session, make sure not to use the context manager functionality unless you want to live dangerously).
- Parameters:
- session
WaybackSession, optional
- session
- search(url, *, match_type=None, limit=1000, offset=None, fast_latest=None, from_date=None, to_date=None, filter_field=None, collapse=None, resolve_revisits=True, skip_malformed_results=True, matchType=None, fastLatest=None, resolveRevisits=None)[source]
Search archive.org’s CDX API for all captures of a given URL. This returns an iterator of
CdxRecordobjects. The StopIteration value is the total count of found captures.Results include captures with similar, but not exactly matching URLs. They are matched by a SURT-formatted, canonicalized URL that:
Does not differentiate between HTTP and HTTPS,
Is not case-sensitive, and
Treats
www.andwww*.subdomains the same as no subdomain at all.
This will automatically page through all results for a given search. If you want fewer results, you can stop iterating early:
from itertools import islice first10 = list(islice(client.search(...), 10))
- Parameters:
- url
str The URL to search for captures of.
Special patterns in
urlimply a value for thematch_typeparameter and match multiple URLs:If the URL starts with *. (e.g.
*.epa.gov) ORmatch_type='domain', the search will include all URLs at the given domain and its subdomains.If the URL ends with /* (e.g.
https://epa.gov/*) ORmatch_type='prefix', the search will include all URLs that start with the text up to the*.Otherwise, this returns matches just for the requeted URL.
- match_type
str, optional Determines how to interpret the
urlparameter. It must be one of the following:exact(default) returns results matching the requested URL (see notes about SURT above; this is not an exact string match of the URL you pass in).prefixreturns results that start with the requested URL.hostreturns results from all URLs at the host in the requested URL.domainreturns results from all URLs at the domain or any subdomain of the requested URL.
The default value is calculated based on the format of
url.- limit
int, default: 1000 Maximum number of results per request to the API (not the maximum number of results this function yields).
Negative values return the most recent N results.
Positive values are complicated! The search server will only scan so much data on each query, and if it finds fewer than
limitresults before hitting its own internal limits, it will behave as if if there are no more results, even though there may be.Unfortunately, ideal values for
limitaren’t very predicatable because the search server combines data from different sources, and they do not all behave the same. Their parameters may also be changed over time.In general…
The default value should work well in typical cases.
For frequently captured URLs, you may want to set a higher value (e.g. 12,000) for more efficient querying.
For infrequently captured URLs, you may want to set a lower value (e.g. 100 or even 10) to ensure that your query does not hit internal limits before returning.
For extremely infrequently captured URLs, you may simply want to call
search()multiple times with different, close togetherfrom_dateandto_datevalues.
- offset
int, optional Skip the first N results.
- fast_latestbool, optional
Get faster results when using a negative value for
limit. It may return a variable number of results that doesn’t match the value oflimit. For example,search('http://epa.gov', limit=-10, fast_latest=True)may return any number of results between 1 and 10.- from_date
datetimeordate, optional Only include captures after this date. Equivalent to the from argument in the CDX API. If it does not have a time zone, it is assumed to be in UTC.
- to_date
datetimeordate, optional Only include captures before this date. Equivalent to the to argument in the CDX API. If it does not have a time zone, it is assumed to be in UTC.
- filter_field
strorlistofstrortupleofstr, optional A filter or list of filters for any field in the results. Equivalent to the
filterargument in the CDX HTTP API. Format:[!]field:regexor~[!]field:substring, e.g.'!statuscode:200'to select only captures with a non-2xx status code, or'~urlkey:feature'to select only captures where the SURT-formatted URL key has “feature” somewhere in it.To apply multiple filters, use a list or tuple of strings instead of a single string, e.g.
['statuscode:200', 'urlkey:.*feature.*'].Regexes are matched against the entire field value. For example,
'statuscode:2'will never match, because allstatuscodevalues are three characters. Instead, to match all status codes with a “2” in them, use'statuscode:.*2.*'. Add a!at before the field name to negate the match.Valid field names are:
urlkey,timestamp,original,mimetype,statuscode,digest,length.- collapse
str, optional Collapse consecutive results that match on a given field. (format: fieldname or fieldname:N – N is the number of chars to match.)
- resolve_revisitsbool, default:
True Attempt to resolve
warc/revisitrecords to their actual content type and response code. Not supported on all CDX servers.- skip_malformed_resultsbool, default:
True If true, don’t yield records that look like they have no actual memento associated with them. Some crawlers will erroneously attempt to capture bad URLs like
http://mailto:someone@domain.comorhttp://data:image/jpeg;base64,AF34...and so on. This is a filter performed client side and is not a CDX API argument.
- url
- Yields:
- Raises:
UnexpectedResponseFormatIf the CDX response was not parseable.
Notes
Several CDX API parameters are not relevant or handled automatically by this function. This does not support: output, fl, showDupeCount, showSkipCount, lastSkipTimestamp, showNumPages, showPagedIndex.
It also does not support page and pageSize for pagination because they work differently from the resumeKey method this uses, and results do not include recent captures when using them.
References
HTTP API Docs: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
SURT formatting: http://crawler.archive.org/articles/user_manual/glossary.html#surt
SURT implementation: https://github.com/internetarchive/surt
- get_memento(url, timestamp=None, mode=Mode.original, *, exact=True, exact_redirects=None, target_window=datetime.timedelta(days=1), follow_redirects=True, datetime=None)[source]
Fetch a memento (an archived HTTP response) from the Wayback Machine.
Not all mementos can be successfully fetched (or “played back” in Wayback terms). In this case,
get_mementocan load the next-closest-in-time memento or it will raisewayback.exceptions.MementoPlaybackErrordepending on the value of theexactandexact_redirectsparameters (see more details below).- Parameters:
- url
strorCdxRecord URL to retrieve a memento of. This can be any of:
A normal URL (e.g.
http://www.noaa.gov/). When using this form, you must also specifytimestamp.A
CdxRecordretrieved fromwayback.WaybackClient.search().A URL of the memento in Wayback, e.g.
https://web.archive.org/web/20180816111911id_/http://www.noaa.gov/
- timestamp
datetime.datetimeordatetime.dateorstr, optional The time at which to retrieve a memento of
url. Ifurlis awayback.CdxRecordor full memento URL, this parameter can be omitted.- mode
wayback.Modeorstr, default:wayback.Mode.original The playback mode of the memento. This determines whether the content of the returned memento is exactly as originally captured (the default) or modified in some way. See
wayback.Modefor a description of possible values.For more details, see: https://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode
- exactbool, default:
True If false and the requested memento either doesn’t exist or can’t be played back, this returns the closest-in-time memento to the requested one, so long as it is within
target_window. If there was no memento in the target window or ifexact=True, then this will raisewayback.exceptions.MementoPlaybackError.- exact_redirectsbool, optional
If false and the requested memento is a redirect whose target doesn’t exist or can’t be played back, this returns the closest-in-time memento to the intended target, so long as it is within
target_window. If unset, this will be the same asexact.- target_window
intordatetime.timedelta, default: 86400 If the memento is of a redirect, allow up to this amount of time (in seconds if an integer) between the capture of the redirect and the capture of the redirect’s target URL. This window also applies to the first memento if
exact=Falseand the originally requested memento was not available. Defaults to 86,400 (24 hours).- follow_redirectsbool, default:
True If true (the default),
get_mementowill follow historical redirects to return the content that a web browser would have ultimately displayed at the requested URL and time, rather than the memento of an HTTP redirect response (i.e. a 3xx status code). That is, ifhttp://example.com/aredirected tohttp://example.com/b, then this method returns the memento for/awhenfollow_redirects=Falseand the memento for/bwhenfollow_redirects=True.
- url
- Returns:
- class wayback.CdxRecord(urlkey: str, timestamp: datetime, original: str, mimetype: str, statuscode: int | None, digest: str, length: int | None)[source]
Represents an entry from Wayback’s “CDX” index of mementos (archived HTTP responses). These entries contain some metadata about the memento. You can also pass a
CdxRecordtoWaybackClient.get_memento()to retrieve the corresponding memento.In general, you should not create new instances of
CdxRecordyourself, but should get them by callingWaybackClient.search().Attributes
- timestamp: datetime
The capture time represented as a
datetime.datetime, such asdatetime.datetime(1996, 12, 31, 23, 58, 47, tzinfo=timezone.utc).
- mimetype: str
MIME type of record, such as
'text/html','warc/revisit'or'unk'(“unknown”) if this information was not captured.
- statuscode: int | None
Status code returned by the server when the record was captured, such as
200. This is may beNoneif the record is a revisit record.
- digest: str
The base 32-encoded SHA-1 hash of the archived HTTP response body. This can be useful for comparing to other
CdxRecordinstances or avoiding duplicate requests for mementos.Please keep in mind that this digest is generally computed based on the response body as stored on disk (usually the exact bytes originally received when saving the response), so is not useful for validation or fixity checks against a memento loaded with
WaybackClient.get_memento(). For example, if the response body was stored in brotli-compressed form but transferred to you in gzip-compressed form, your bytes will not match this digest.For revisit records, this is the digest of the originally received HTTP response body as it would have been stored, so you can use it to match a non-revisit record containing the same response body. (But keep in mind this is just the body. It does not include HTTP headers, which may have been different for two records with the same digest.)
- length: int | None
Size (in bytes) of the archived data as stored on disk. Like
digest, this usually will not be useful for external users, since it does not reflect the actual archived HTTP response body size. For example, revisit records will generally be small because the archived data on disk is just a pointer to a different record that was saved previously.
- raw_url: str
The URL to the raw captured content, such as
'https://web.archive.org/web/19961231235847id_/http://www.nasa.gov/'.
- view_url: str
The URL to the public view on Wayback Machine. In this view, the links and some subresources in the document are rewritten to point to Wayback URLs. There is also a navigation panel around the content. Example URL:
'https://web.archive.org/web/19961231235847/http://www.nasa.gov/'.
- key: str
Deprecated since version 0.5.0: This attribute was renamed to
urlkey. This name will be removed in a future release.
- url: str
Deprecated since version 0.5.0: This attribute was renamed to
original. This name will be removed in a future release.
- mime_type: str
Deprecated since version 0.5.0: This attribute was renamed to
mimetype. This name will be removed in a future release.
- status_code: str
Deprecated since version 0.5.0: This attribute was renamed to
statuscode. This name will be removed in a future release.
- class wayback.Memento(*, url, timestamp, mode, memento_url, status_code, headers, encoding, raw, raw_headers, links, history, debug_history)[source]
Represents a memento (an archived HTTP response). This object is similar to a response object from the popular “Requests” package, although it has some differences designed to differentiate historical information vs. current metadata about the stored memento (for example, the
headersattribute lists the headers recorded in the memento, and does not include additional headers that provide metadata about the Wayback Machine).Note that, like an HTTP response, this object represents a potentially open network connection to the Wayback Machine. Reading the
contentortextattributes will read all the data being received and close the connection automatically, but if you do not read those properties, you must make sure to callclose()to close to connection. Alternatively, you can use a Memento as a context manager. The connection will be closed for you when the context ends:>>> with a_memento: >>> do_something() >>> # Connection is automatically closed here.
Fields
- headers: dict
A dict representing the headers of the archived HTTP response. The keys are case-insensitive. If you iterate over it, you will receive the header names as they were originally sent. However, you can look them up via strings that vary in upper/lower-case. For example:
list(memento.headers) == ['Content-Type', 'Date'] memento.headers['Content-Type'] == memento.headers['content-type']
- history: tuple[wayback.Memento]
A list of
wayback.Mementoobjects that were redirects and were followed to produce this memento.
- debug_history: tuple[str]
List of all URLs redirects followed in order to produce this memento. These are “memento URLs” – that is, they are absolute URLs to the Wayback machine like
https://web.archive.org/web/20180816111911id_/http://www.noaa.gov/, rather than URLs of captured redirects, likehttp://www.noaa.gov. Many of the URLs in this list do not represent actual mementos.
- timestamp: datetime.datetime
The time the memento was originally captured. This includes
tzinfo, and will always be in UTC.
- memento_url: str
The URL at which the memento was fetched from the Wayback Machine, e.g.
https://web.archive.org/web/20180816111911id_/http://www.noaa.gov/.
- links: dict of (str, dict of (str, str))
Related links to this Memento (e.g. the previous and/or next Memento in time). The keys are the relationship (e.g.
'prev memento') as a string and the values are dicts where the keys and values are strings.In each entry, the
'url'key is the URL of the related link, the'rel'key is the relationship (the same as the key in the top-level dict), and the rest of the keys will be any other attributes that are relevant for that link (e.g.'datetime'or'type').For example:
{ 'original': { 'url': 'https://www.fws.gov/birds/', 'rel': 'original' }, 'first memento': { 'url': 'https://web.archive.org/web/20050323155300/http://www.fws.gov:80/birds', 'rel': 'first memento', 'datetime': 'Wed, 23 Mar 2005 15:53:00 GMT' }, 'prev memento': { 'url': 'https://web.archive.org/web/20210125125216/https://www.fws.gov/birds/', 'rel': 'prev memento', 'datetime': 'Mon, 25 Jan 2021 12:52:16 GMT' }, 'next memento': { 'url': 'https://web.archive.org/web/20210321180831/https://www.fws.gov/birds', 'rel': 'next memento', 'datetime': 'Sun, 21 Mar 2021 18:08:31 GMT' }, 'last memento': { 'url': 'https://web.archive.org/web/20221006031005/https://fws.gov/birds', 'rel': 'last memento', 'datetime': 'Thu, 06 Oct 2022 03:10:05 GMT' } }
Links to other mementos use the same mode as the memento object this
linksattribute belongs to. For example:raw_memento = client.get_memento('https://fws.gov/birds', '20210318004901') raw_memento.links['next memento']['url'] == 'https://web.archive.org/web/20210321180831id_/https://fws.gov/birds' # The "id_" after the timestamp means "original" mode ---------------------------------^^^ view_memento = client.get_memento('https://fws.gov/birds', '20210318004901', mode=Mode.view) view_memento.links['next memento']['url'] == 'https://web.archive.org/web/20210321180831/https://fws.gov/birds' # Nothing after the timestamp for "view" mode -----------------------------------------^
- close()[source]
Close the HTTP response for this Memento. This happens automatically if you read
contentortext, and if you use the memento as a context manager. This method is always safe to call – it does nothing if the response has already been closed.
- class wayback.WaybackSession(retries=6, backoff=2, timeout=60, user_agent=None, search_calls_per_second=<wayback._utils.RateLimit object>, memento_calls_per_second=<wayback._utils.RateLimit object>, timemap_calls_per_second=<wayback._utils.RateLimit object>)[source]
A custom session object that pools network connections and resources for requests to the Wayback Machine.
- Parameters:
- retries
int, default: 6 The maximum number of retries for requests.
- backoff
intorfloat, default: 2 Number of seconds from which to calculate how long to back off and wait when retrying requests. The first retry is always immediate, but subsequent retries increase by powers of 2:
seconds = backoff * 2 ^ (retry number - 1)
So if this was 4, retries would happen after the following delays: 0 seconds, 4 seconds, 8 seconds, 16 seconds, …
- timeout
intorfloatortupleof(intorfloat,intorfloat), default: 60 A timeout to use for all requests. See the Requests docs for more: https://docs.python-requests.org/en/master/user/advanced/#timeouts
- user_agent
str, optional A custom user-agent string to use in all requests. Defaults to: wayback/{version} (+https://github.com/edgi-govdata-archiving/wayback)
- search_calls_per_second
wayback.RateLimitorintorfloat, default: 0.8 The maximum number of calls per second made to the CDX search API. To disable the rate limit, set this to 0.
To have multiple sessions share a rate limit (so requests made by one session count towards the limit of the other session), use a single
wayback.RateLimitinstance and pass it to eachWaybackSessioninstance. If you do not set a limit, the default limit is shared globally across all sessions.- memento_calls_per_second
wayback.RateLimitorintorfloat, default: 8 The maximum number of calls per second made to the memento API. To disable the rate limit, set this to 0.
To have multiple sessions share a rate limit (so requests made by one session count towards the limit of the other session), use a single
wayback.RateLimitinstance and pass it to eachWaybackSessioninstance. If you do not set a limit, the default limit is shared globally across all sessions.- timemap_calls_per_second
wayback.RateLimitorintorfloat, default: 1.33 The maximum number of calls per second made to the timemap API (the Wayback Machine’s new, beta CDX search is part of the timemap API). To disable the rate limit, set this to 0.
To have multiple sessions share a rate limit (so requests made by one session count towards the limit of the other session), use a single
wayback.RateLimitinstance and pass it to eachWaybackSessioninstance. If you do not set a limit, the default limit is shared globally across all sessions.
- retries
Utilities
- wayback.memento_url_data(memento_url)[source]
Get the original URL, time, and mode that a memento URL represents a capture of.
- Returns:
- url
str The URL that the memento is a capture of.
- time
datetime.datetime The time the memento was captured in the UTC timezone.
- mode
str The playback mode.
- url
Examples
Extract original URL, time and mode.
>>> url = ('https://web.archive.org/web/20170813195036id_/' ... 'https://arpa-e.energy.gov/?q=engage/events-workshops') >>> memento_url_data(url) ('https://arpa-e.energy.gov/?q=engage/events-workshops', datetime.datetime(2017, 8, 13, 19, 50, 36, tzinfo=timezone.utc), 'id_')
- class wayback.Mode(*values)[source]
An enum describing the playback mode of a memento. When requesting a memento (e.g. with
wayback.WaybackClient.get_memento()), you can use these values to determine how the response body should be formatted.For more details, see: https://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode
Examples
>>> waybackClient.get_memento('https://noaa.gov/', >>> timestamp=datetime.datetime(2018, 1, 2), >>> mode=wayback.Mode.view)
Values
- original
Returns the HTTP response body as originally captured.
- view
Formats the response body so it can be viewed with a web browser. URLs for links and subresources like scripts, stylesheets, images, etc. will be modified to point to the equivalent memento in the Wayback Machine so that the resulting page looks as similar as possible to how it would have appeared when originally captured. It’s mainly meant for use with HTML pages. This is the playback mode you typically use when browsing the Wayback Machine with a web browser.
- javascript
Formats the response body by updating URLs, similar to
Mode.view, but designed for JavaScript instead of HTML.
- css
Formats the response body by updating URLs, similar to
Mode.view, but designed for CSS instead of HTML.
- image
formats the response body similar to
Mode.view, but designed for image files instead of HTML.
- class wayback.RateLimit(per_second: int | float)[source]
RateLimitis a simple locking mechanism that can be used to enforce rate limits and is safe to use across multiple threads. It can also be used as a context manager.Calling rate_limit_instance.wait() blocks until a minimum time has passed since the last call. Using with rate_limit_instance: blocks entries to the context until a minimum time since the last context entry.
- Parameters:
Examples
Slow down a tight loop to only occur twice per second:
>>> limit = RateLimit(per_second=2) >>> for x in range(10): >>> with limit: >>> print(x)
Exception Classes
- class wayback.exceptions.WaybackException[source]
Base exception class for all Wayback-specific errors.
- class wayback.exceptions.UnexpectedResponseFormat[source]
Raised when data returned by the Wayback Machine is formatted in an unexpected or unparseable way.
- class wayback.exceptions.BlockedByRobotsError[source]
Raised when a URL can’t be queried in Wayback because it was blocked by a site’s robots.txt file.
- class wayback.exceptions.BlockedSiteError[source]
Raised when a URL has been blocked from access or querying in Wayback. This is often because of a takedown request. (URLs that are blocked because of
robots.txtget aBlockedByRobotsErrorinstead.)
- class wayback.exceptions.MementoPlaybackError[source]
Raised when a Memento can’t be ‘played back’ (loaded) by the Wayback Machine for some reason. This is a server-side issue, not a problem in parsing data from Wayback.
- class wayback.exceptions.NoMementoError[source]
Raised when there was no memento available for a given URL. This might mean the given URL has no mementos at all or that none that are available for playback.
This also means you should not try to request a memento of the same URL in a different timeframe. If there may be other mementos of the URL available, you’ll get a different error.
- class wayback.exceptions.RateLimitError(response, retry_after)[source]
Raised when the Wayback Machine responds with a 429 (too many requests) status code. In general, this package’s built-in limits should help you avoid ever hitting this, but if you are running multiple processes in parallel, you could go overboard.
- Attributes:
- retry_after
int, optional Recommended number of seconds to wait before retrying. If the Wayback Machine does not include it in the HTTP response, it will be set to
None.
- retry_after