YAML Crawl Config¶
Browsertix Crawler supports the use of a YAML file to set parameters for a crawl. This can be used by passing a valid yaml file to the --config
option.
The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the YAML file, the value from the command-line will be used. For example, the following should start a crawl with config in crawl-config.yaml
.
docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config /app/crawl-config.yaml
The config can also be passed via stdin, which can simplify the command. Note that this require running docker run
with the -i
flag. To read config from stdin, pass --config stdin
cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config stdin
An example config file (eg. crawl-config.yaml) might contain:
The list of seeds can be loaded via an external file by specifying the filename via the seedFile
config or command-line option.
Seed File¶
The URL seed file should be a text file formatted so that each line of the file is a url string. An example file is available in the Github repository's fixture folder as urlSeedFile.txt.
The seed file must be passed as a volume to the docker container. Your Docker command should be formatted similar to the following:
docker run -v $PWD/seedFile.txt:/app/seedFile.txt -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --seedFile /app/seedFile.txt
Per-Seed Settings¶
Certain settings such as scope type, scope includes and excludes, and depth can also be configured per-seed directly in the YAML file, for example: