Commonly-Used Options¶
Waiting for Page Load¶
One of the key nuances of browser-based crawling is determining when a page is finished loading. This can be configured with the --waitUntil
flag.
The default is load,networkidle2
, which waits until page load and ≤2 requests remain, but for static sites, --wait-until domcontentloaded
may be used to speed up the crawl (to avoid waiting for ads to load for example). --waitUntil networkidle0
may make sense for sites where absolutely all requests must be waited until before proceeding.
See page.goto waitUntil options for more info on the options that can be used with this flag from the Puppeteer docs.
The --pageLoadTimeout
/--timeout
option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.
Additional Wait¶
Occasionally, a page may seem to have loaded, but performs dynamic initialization / additional loading. This is can be hard to detect, and the --postLoadDelay
flag
can be used to specify additional seconds to wait after the page appears to have loaded, before moving on to post-processing actions, such as link extraction, screenshotting and text extraction (see below).
(On the other hand, the --pageExtraDelay
/--delay
adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)
Link Extraction¶
By default, the crawler will extract all href
properties from all <a>
tags that have an href
.
This can be customized with the --selectLinks
option, which can provide alternative selectors of the form:
[css selector]->[property to use]
or [css selector]->@[attribute to use]
. The default value is a[href]->href
.
For example, to specify the default, but also include all divs
that have class mylink
and use custom-href
attribute as the link, use --selectLinks 'a[href]->href' --selectLinks 'div.mylink->@custom-href'
.
Any number of selectors can be specified in this way, and each will be applied in sequence on each page.
Ad Blocking¶
Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These Shields be disabled or customized using Browser Profiles.
Browsertrix Crawler also supports blocking ads from being loaded during capture based on Stephen Black's list of known ad hosts. To enable ad blocking based on this list, use the --blockAds
option. If --adBlockMessage
is set, a record with the specified error message will be added in the ad's place.
Sitemap Parsing¶
The --sitemap
option can be used to have the crawler parse a sitemap and queue any found URLs while respecting the crawl's scoping rules and limits. Browsertrix Crawler is able to parse regular sitemaps as well as sitemap indices that point out to nested sitemaps.
By default, --sitemap
will look for a sitemap at <your-seed>/sitemap.xml
. If a website's sitemap is hosted at a different URL, pass the URL with the flag like --sitemap <sitemap url>
.
The --sitemapFrom
/--sitemapFromDate
and --sitemapTo
/--sitemapToDate
options allow for only extracting pages within a specific date range. If set, these options will filter URLs from sitemaps to those greater than or equal to (>=) or lesser than or equal to (<=) a provided ISO Date string (YYYY-MM-DD
, YYYY-MM-DDTHH:MM:SS
, or partial date), respectively.
Custom Warcinfo Fields¶
Custom fields can be added to the warcinfo
WARC record, generated for each combined WARC. The fields can be specified in the YAML config under warcinfo
section or specifying individually via the command-line.
For example, the following are equivalent ways to add additional warcinfo fields:
via yaml config:
via command-line:
Screenshots¶
Browsertrix Crawler includes the ability to take screenshots of each page crawled via the --screenshot
option.
Three screenshot options are available:
--screenshot view
: Takes a png screenshot of the initially visible viewport (1920x1080)--screenshot fullPage
: Takes a png screenshot of the full page--screenshot thumbnail
: Takes a jpeg thumbnail of the initially visible viewport (1920x1080)
These can be combined using a comma-separated list passed via the --screenshot
option, e.g.: --screenshot thumbnail,view,fullPage
or passed in separately --screenshot thumbnail --screenshot view --screenshot fullPage
.
Screenshots are written into a screenshots.warc.gz
WARC file in the archives/
directory. If the --generateWACZ
command line option is used, the screenshots WARC is written into the archive
directory of the WACZ file and indexed alongside the other WARCs.
Screencasting¶
Browsertrix Crawler includes a screencasting option which allows watching the crawl in real-time via screencast (connected via a websocket).
To enable, add --screencastPort
command-line option and also map the port on the docker container. An example command might be:
docker run -p 9037:9037 -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --url https://www.example.com --screencastPort 9037
Then, open http://localhost:9037/
and watch the crawl!
Text Extraction¶
Browsertrix Crawler supports text extraction via the --text
flag, which accepts one or more of the following extraction options:
--text to-pages
— Extract initial text and add it to the text field in pages.jsonl--text to-warc
— Extract initial page text and add it to aurn:text:<url>
WARC resource record--text final-to-warc
— Extract the final page text after all behaviors have run and add it to aurn:textFinal:<url>
WARC resource record
The options can be separate or combined into a comma separate list, eg. --text to-warc,final-to-warc
or --text to-warc --text final-to-warc
are equivalent. For backwards compatibility, --text
alone is equivalent to --text to-pages
.
Uploading Crawl Outputs to S3-Compatible Storage¶
Browsertrix Crawler includes support for uploading WACZ files to S3-compatible storage, and notifying a webhook when the upload succeeds.
S3 upload is only supported when WACZ output is enabled and will not work for WARC output.
This feature can currently be enabled by setting environment variables (for security reasons, these settings are not passed in as part of the command-line or YAML config at this time).
Environment variables for S3-uploads include:
STORE_ACCESS_KEY
/STORE_SECRET_KEY
— S3 credentialsSTORE_ENDPOINT_URL
— S3 endpoint URLSTORE_PATH
— optional path appended to endpoint, if providedSTORE_FILENAME
— filename or template for filename to put on S3STORE_USER
— optional username to pass back as part of the webhook callbackSTORE_REGION
- optional region to pass to S3 endpoint. Defaults tous-east-1
if unspecified.CRAWL_ID
— unique crawl id (defaults to container hostname)WEBHOOK_URL
— the URL of the webhook (can be http://, https://, or redis://)
Webhook Notification¶
The webhook URL can be an HTTP URL which receives a JSON POST request OR a Redis URL, which specifies a redis list key to which the JSON data is pushed as a string.
Webhook notification JSON includes:
id
— crawl id (value ofCRAWL_ID
)userId
— user id (value ofSTORE_USER
)filename
— bucket path + filename of the filesize
— size of WACZ filehash
— SHA-256 of WACZ filecompleted
— boolean of whether crawl fully completed or partially (due to interrupt signal or other error).
Saving Crawl State: Interrupting and Restarting the Crawl¶
A crawl can be gracefully interrupted with Ctrl-C (SIGINT) or a SIGTERM (see below for more details).
When a crawl is interrupted, the current crawl state is written to the crawls
subdirectory inside the collection directory. The crawl state includes the current YAML config, if any, plus the current state of the crawl.
This crawl state YAML file can then be used as --config
option to restart the crawl from where it was left of previously. When restarting a crawl you will need to include any command line options you used to start the original crawl (e.g. --url
), since these are not persisted to the crawl state.
By default, the crawl interruption waits for current pages to finish. A subsequent SIGINT will cause the crawl to stop immediately. Any unfinished pages are recorded in the pending
section of the crawl state (if gracefully finished, the section will be empty).
By default, the crawl state is only written when a crawl is interrupted before completing. The --saveState
cli option can be set to always
or never
respectively, to control when the crawl state file should be written.
Periodic State Saving¶
When the --saveState
is set to always, Browsertrix Crawler will also save the state automatically during the crawl, as set by the --saveStateInterval
setting. The crawler will keep the last --saveStateHistory
save states and delete older ones. This provides extra backup, in the event that the crawl fails unexpectedly or is not terminated via Ctrl-C, several previous crawl states are still available.
Crawl Interruption Options¶
Browsertrix Crawler has different crawl interruption modes, and does everything it can to ensure the WARC data written is always valid when a crawl is interrupted. The following are three interruption scenarios:
1. Graceful Shutdown¶
Initiated when a single SIGINT (Ctrl+C) or SIGTERM (docker kill -s SIGINT
, docker kill -s SIGTERM
, kill
) signal is received.
The crawler will attempt to finish current pages, finish any pending async requests, write all WARCS, generate WACZ files and finish other post-processing, save state from Redis, and then exit.
2. Less-Graceful, Quick Shutdown¶
If a second SIGINT / SIGTERM is received, the crawler will close the browser immediately, interrupting any on-going network requests. Any asynchronous fetching will not be finished. However, anything in the WARC queue will be written and WARC files will be flushed. WACZ files and other post-processing will not be generated, but the current state from Redis will still be saved if enabled (see above). WARC records should be fully finished and WARC files should be valid, though they may not contain all the data for the pages being processed during the interruption.
3. Violent / Immediate Shutdown¶
If a crawler is killed, eg. with SIGKILL signal (docker kill
, kill -9
), the crawler container / process will be immediately shut down. It will not have a chance to finish any WARC files, and there is no guarantee that WARC files will be valid, but the crawler will of course exit right away.
Recommendations¶
It is recommended to gracefully stop the crawler by sending a SIGINT or SIGTERM signal, which can be done via Ctrl+C or docker kill -s SIGINT <containerid>
. Repeating the command will result in a faster, slightly less-graceful shutdown.
Using SIGKILL is not recommended except for last resort, and only when data is to be discarded.
Note: When using the crawler in the Browsertrix app / in Kubernetes general, stopping a crawl / stopping a pod always results in option #1 (sending a single SIGTERM signal) to the crawler pod(s)