Quality Assurance¶
Overview¶
Browsertrix Crawler can analyze an existing crawl to compare what the browser encountered on a website during crawling against the replay of the crawl WACZ. The WACZ produced by this analysis run includes additional comparison data (stored as WARC resource
records) for the pages found during crawling against their replay in ReplayWeb.page. This works along several dimensions, including screenshot, extracted text, and page resource comparisons.
Note
QA features described on this page are available in Browsertrix Crawler releases 1.1.0 and later.
Getting started¶
To be able to run QA on a crawl, you must first have an existing crawl, for example:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://webrecorder.net/ --collection example-crawl --text to-warc --screenshot view --generateWACZ
Note that this crawl must be run with --generateWACZ
flag as QA requires a WACZ to work with, and also ideally the --text to-warc
and --screenshot view
flags as well (see below for more details on comparison dimensions).
To analyze this crawl, call Browsertrix Crawler with the qa
entrypoint, passing the original crawl WACZ as the qaSource
:
docker run -v $PWD/crawls/:/crawls/ -it webrecorder/browsertrix-crawler qa --qaSource /crawls/collections/example-crawl/example-crawl.wacz --collection example-qa --generateWACZ
The qaSource
can be:
- A local WACZ file path or a URL
- A single WACZ or a JSON file containing a list of WACZ files in the resources
json (Multi-WACZ)
This assumes an existing crawl that was created in the example-crawl
collection.
A new WACZ for the analysis run will be created in the resulting example-qa
collection.
By default, the analysis crawl will visit all of the pages (as read from the source WACZ file(s)), however pages can further be limited by adding --include
and --exclude
regexes. The --limit
flag will also limit how many pages are tested.
The analysis crawl will skip over any non-HTML pages such as PDFs which can be relied upon to be bit-for-bit identical as long as the resource was fully fetched.
Comparison Dimensions¶
Screenshot Match¶
One way to compare crawl and replay is to compare the screenshots of a page while it is being crawled with when it is being replayed. The initial viewport screenshots of each page from the crawl and replay are compared on the basis of pixel value similarity. This results in a score between 0 and 1.0 representing the percentage match between the crawl and replay screenshots for each page. The screenshots are stored in urn:view:<url>
WARC resource records.
To enable comparison on this dimension, the crawl must be run with at least the --screenshot view
option. (Additional screenshot options can be added as well).
Text Match¶
Another way to compare the crawl and replay results is to use the text extracted from the HTML. This is done by comparing the extracted text from crawl and replay on the basis of Levenshtein distance. This results in a score between 0 and 1.0 representing the percentage match between the crawl and replay text for each page. The extracted text is stored in urn:text:<url>
WARC resource records.
To enable comparison on this dimension, the original crawl must be run with at least the --text to-warc
option. (Additional text options can be added as well)
Resources and Page Info¶
The pageinfo
records produced by the crawl and analysis runs include a JSON document containing information about the resources loaded on each page, such as CSS stylesheets, JavaScript scripts, fonts, images, and videos. The URL, status code, MIME type, and resource type of each resource is saved in the pageinfo
record for each page.
Since pageinfo
records are produced for all crawls, this data is always available.
Comparison Data¶
Comparison data is also added to the QA crawl's pageinfo
records. The comparison data may look as follows:
"comparison": {
"screenshotMatch": 0.95,
"textMatch": 0.9,
"resourceCounts": {
"crawlGood": 10,
"crawlBad": 0,
"replayGood": 9,
"replayBad": 1
}
}
This data indicates that:
- When comparing
urn:view:<url>
records for crawl and replay, the screenshots are 95% similar. - When comparing
urn:text:<url>
records from crawl and replay WACZs, the text is 90% similar. - When comparing
urn:pageinfo:<url>
resource entries from crawl and replay, the crawl record had 10 good responses (2xx/3xx status code) and 0 bad responses (4xx/5xx status code), while replay had 9 good and 1 bad.