YAML Crawl Config¶

Browsertix Crawler supports the use of a YAML file to set parameters for a crawl. This can be used by passing a valid yaml file to the --config option.

The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the YAML file, the value from the command-line will be used. For example, the following should start a crawl with config in crawl-config.yaml.

docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config /app/crawl-config.yaml

The config can also be passed via stdin, which can simplify the command. Note that this require running docker run with the -i flag. To read config from stdin, pass --config stdin

cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config stdin

An example config file (eg. crawl-config.yaml) might contain:

seeds:
  - https://example.com/
  - https://www.iana.org/

combineWARC: true

The list of seeds can be loaded via an external file by specifying the filename via the seedFile config or command-line option.

Seed File¶

The URL seed file should be a text file formatted so that each line of the file is a url string. An example file is available in the Github repository's fixture folder as urlSeedFile.txt.

The seed file must be passed as a volume to the docker container. Your Docker command should be formatted similar to the following:

docker run -v $PWD/seedFile.txt:/app/seedFile.txt -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --seedFile /app/seedFile.txt

Per-Seed Settings¶

Certain settings such as scope type, scope includes and excludes, and depth can also be configured per-seed directly in the YAML file, for example:

seeds:
  - url: https://webrecorder.net/
    depth: 1
    scopeType: "prefix"

HTTP Auth¶

HTTP basic auth credentials are written to the archive

We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished.

Browsertrix Crawler supports HTTP Basic Auth, which can be provide on a per-seed basis as part of the URL, for example: --url https://username:password@example.com/.

Alternatively, credentials can be added to the auth field for each seed:

seeds:
  - url: https://example.com/
    auth: username:password