Reports¶
Browsertrix has the option to generate optional reports with each crawl. The following reports are currently available. Browsertrix has the option to generate optional reports with each crawl. All reports are in the JSONL format, with one JSON entry per line. The following reports are currently available.
Skipped Pages Report¶
Written to reports/skippedPages.jsonl and enabled with --reportSkipped, this report is in the same format as the pages/pages.jsonl file, but lists pages that were either never loaded or where page loading was immediately aborted and no content was archived from that page.
Each line in the report contains the following:
url: Page URLts: The ISO Date of the time the page was encounteredseedUrl: The seed URL that this page was discovered fromdepth: The depth of the page if it were to be crawledseed: true|false if the page is a seedreason: Reason for skipping this page
Skip Reasons¶
The reason may be one of the following:
outOfScope: Page URL out of scope according to scoping rulespageLimit: The limit--pageLimitwas reached before the page could be crawledrobotsTxt: The page URL has been excluded via robots.txt rulesredirectToExcluded: A special case ofoutOfScopewhere the page URL itself is in scope but loading it resulted in a HTTP redirect to a page that was not in scope, so page loading was abortedduplicate: The page content is a duplicate and loading was aborted (see Page Deduplication for more information)