Outputs¶
This page covers the outputs created by Browsertrix Crawler for both crawls and browser profiles.
Crawl Outputs¶
Browsertrix Crawler crawl outputs are organized into collections, which can be found in the /crawls/collection
directory. Each crawl creates a new collection by default, which can be named with the -c
or --collection
argument. If a collection name is not provided, Browsertrix Crawler will generate a unique collection name which includes the crawl-
prefix followed by a timestamp of when the collection was created. Collections can be overwritten by specifying an existing collection name.
Each collection is a directory which contains at minimum:
archive/
: A directory containing gzipped WARC files containing the web traffic recorded during crawling.logs/
: A directory containing one or more crawler log files in JSON-Lines format.pages/
: A directory containing one or more "Page" files in JSON-Lines format. At minimum, this directory will contain apages.jsonl
file with information about the seed URLs provided to the crawler. If additional pages were discovered and in scope during crawling, information about those non-seed pages is written toextraPages.jsonl
. For more information about the contents of Page files, see the WACZ specification.warc-cdx/
: A directory containing one or more CDXJ index files created while recording traffic to WARC files. These index files are
Additionally, the collection may include:
- A WACZ file named after the collection, if the
--generateWACZ
argument is provided. - An
indexes/
directory containing merged CDXJ index files for the crawl, if the--generateCDX
or--generateWACZ
arguments are provided. If the combined size of the CDXJ files in thewarc-cdx/
directory is over 50 KB, the resulting final CDXJ file will be gzipped. - A single combined gzipped WARC file for the crawl, if the
--combineWARC
argument is provided. - A
crawls/
directory including YAML files describing the crawl state, if the--saveState
argument is provided with a value of "always", or if the crawl is interrupted and--saveState
is not set to "never". These files can be used to restart a crawl from its saved state.
Profile Outputs¶
Browser profiles that are saved by Browsertrix Crawler are written into the crawls/profiles
directory.