Crawl Scope¶
Configuring Pages Included or Excluded from a Crawl¶
The crawl scope can be configured globally for all seeds, or customized per seed, by specifying the --scopeType
command-line option or setting the type
property for each seed.
The depth
option also limits how many pages will be crawled for that seed, while the limit
option sets the total number of pages crawled from any seed.
The scope controls which linked pages are included and which pages are excluded from the crawl.
To make this configuration as simple as possible, there are several predefined scope types. The available types are:
-
page
— crawl only this page and no additional links. -
page-spa
— crawl only this page, but load any links that include different hashtags. Useful for single-page apps that may load different content based on hashtag. -
prefix
— crawl any pages in the same directory, eg. starting fromhttps://example.com/path/page.html
, crawl anything underhttps://example.com/path/
(default) -
host
— crawl pages that share the same host. -
domain
— crawl pages that share the same domain and subdomains, eg. givenhttps://example.com/
will also crawlhttps://anysubdomain.example.com/
-
any
— crawl any and all pages linked from this page.. -
custom
— crawl based on the--include
regular expression rules.
The scope settings for multi-page crawls (page-spa, prefix, host, domain) also include http/https versions, eg. given a prefix of http://example.com/path/
, https://example.com/path/
is also included.
Custom Scope Inclusion Rules¶
Instead of setting a scope type, it is possible to configure a custom scope regular expression (regex) by setting --include
to one or more regular expressions. If using the YAML config, the include
field can contain a list of regexes.
Extracted links that match the regular expression will be considered 'in scope' and included.
Custom Scope Exclusion Rules¶
In addition to the inclusion rules, Browsertrix Crawler supports a separate list of exclusion regexes, that if matched, override and exclude a URL from the crawl.
The exclusion regexes are often used with a custom scope, but could be used with a predefined scopeType as well.
Extra 'Hops' Beyond Current Scope¶
Occasionally, it may be useful to augment the scope by allowing extra links N 'hops' beyond the current scope.
For example, this is most useful when crawling with a host
or prefix
scope, but also wanting to include 'one extra hop' — any link to external pages beyond the current host — but not following any of the links on those pages. This is possible with the extraHops
setting, which defaults to 0, but can be set to a higher value N (usually 1) to go beyond the current scope.
The --extraHops
setting can be set globally or per seed to allow expanding the current inclusion scope N 'hops' beyond the configured scope. Note that this mechanism only expands the inclusion scope, and any exclusion rules are still applied. If a URL is to be excluded via the exclusion rules, that will take precedence over the --extraHops
.
Scope Rule Examples¶
Regular expression exclude rules
A crawl started with this config will start on https://example.com/startpage.html
and crawl all pages on the https://example.com/
domain except pages that match the exclusion rules — URLs that contain the strings example.com/skip
or example.com/search
followed by any number of characters, and URLs that contain the string postfeed
.
https://example.com/page.html
will be crawled but https://example.com/skip/postfeed
, https://example.com/skip/this-page.html
, and https://example.com/search?q=searchstring
will not.
Regular expression include and exclude rules
In this example config, the scope includes regular expressions that will crawl all page URLs that match example.com/(crawl-this|crawl-that)
, and exclude any URLs that terminate with exactly skip
.
https://example.com/crawl-this/page.html
and https://example.com/crawl-this/page/skipme/not
will be crawled but https://example.com/crawl-this/page/skip
will not.
More complicated regular expressions
This example exclusion rule targets characters and numbers after search
until the string ID=
, followed by any amount of numbers.
https://example.com/search/ID=5819
, https://example.com/search/6vH8R4Tm
, and https://example.com/search/2o3Jq89cID=5ag8h19
will be crawled but https://example.com/search/6vH8R4TmID=5819
will not.
The include
, exclude
, scopeType
, and depth
settings can be configured per seed or globally for the entire crawl.
The per-seed settings override the per-crawl settings, if any.
See the test suite tests/scopes.test.js for additional examples of configuring scope inclusion and exclusion rules.
Note
Include and exclude rules are always regular expressions. For rules to match, you may have to escape special characters that commonly appear in urls like ?
, +
, or .
by placing a \
before the character. For example: youtube.com/watch\?rdwz7QiG0lk
.
Browsertrix Crawler does not log excluded URLs.
Page Resource Block Rules¶
While scope rules define which pages are to be crawled, it is also possible to block page resources, URLs loaded within a page or within an iframe on a page.
For example, this is useful for blocking ads or other content that is loaded within multiple pages, but should be blocked.
The page rules block rules can be specified as a list in the blockRules
field. Each rule can contain one of the following fields:
-
url
: regex for URL to match (required) -
type
: can beblock
orallowOnly
. The block rule blocks the specified match, while allowOnly inverts the match and allows only the matched URLs, while blocking all others. -
inFrameUrl
: if specified, indicates that the rule only applies whenurl
is loaded in a specific iframe or top-level frame. -
frameTextMatch
: if specified, the text of the specified URL is checked for the regex, and the rule applies only if there is an additional match. When specified, this field makes the block rule apply only to frame-level resource, eg. URLs loaded directly in an iframe or top-level frame.
For example, a very simple block rule that blocks all URLs from 'googleanalytics.com' on any page can be added with:
To instead block 'googleanalytics.com' only if loaded within pages or iframes that match the regex 'example.com/no-analytics', add:
For additional examples of block rules, see the tests/blockrules.test.js file in the test suite.
If the --blockMessage
is also specified, a blocked URL is replaced with the specified message (added as a WARC resource record).
Page Resource Block Rules vs Scope Rules¶
If it seems confusing which rules should be used, here is a quick way to determine:
-
If you'd like to restrict the pages that are being crawled, use the crawl scope rules (defined above).
-
If you'd like to restrict parts of a page that are being loaded, use the page resource block rules described in this section.
The blockRules add a filter to each URL loaded on a page and incur an extra overhead. They should only be used in advanced use cases where part of a page needs to be blocked.
These rules can not be used to prevent entire pages for loading — use the scope exclusion rules for that (a warning will be printed if a page resource block rule matches a top-level page).