Skip to content

Crawling with Proxies

Browser Crawler supports crawling through HTTP and SOCKS5 proxies, including through a SOCKS5 proxy over an SSH tunnel.

To specify a proxy, the PROXY_SERVER environment variable or --proxyServer CLI flag can be passed in. If both are provided, the --proxyServer CLI flag will take precedence.

The proxy server can be specified as a http://, socks5://, or ssh:// URL.

HTTP Proxies

To crawl through an HTTP proxy running at http://path-to-proxy-host.example.com:9000, run the crawler with:

docker run -v $PWD/crawls/:/crawls/ -e PROXY_SERVER=http://path-to-proxy-host.example.com:9000 webrecorder/browsertrix-crawler crawl --url https://example.com/

or

docker run -v $PWD/crawls/:/crawls/ webrecorder/browsertrix-crawler crawl --url https://example.com/ --proxyServer http://path-to-proxy-host.example.com:9000 

The crawler does not support authentication for HTTP proxies, as that is not supported by the browser.

(For backwards compatibility with crawler 0.x, PROXY_HOST and PROXY_PORT environment variables can be used to specify an HTTP proxy instead of PROXY_SERVER which takes precedence if provided).

SOCKS5 Proxies

To use a SOCKS5 proxy running at path-to-proxy-host.example.com:9001, run the crawler with:

docker run -v $PWD/crawls/:/crawls/ -e PROXY_SERVER=socks5://path-to-proxy-host.example.com:9001 webrecorder/browsertrix-crawler crawl --url https://example.com/

The crawler does support password authentication for SOCKS5 proxies, which can be provided as user:password in the proxy URL:

docker run-v $PWD/crawls/:/crawls/ -e PROXY_SERVER=socks5://user:password@path-to-proxy-host.example.com:9001 webrecorder/browsertrix-crawler crawl --url https://example.com/

SSH Proxies

Starting with 1.3.0, the crawler also supports crawling through an SOCKS5 that is established over an SSH tunnel, via ssh -D. With this option, the crawler can SSH into a remote machine that has SSH and port forwarding enabled and crawl through that machine's network.

To use this proxy, the private SSH key file must be provided via --sshProxyPrivateKeyFile CLI flag.

The private key and public host key should be mounted as volumes into a path in the container, as shown below.

For example, to connect via SSH to host path-to-ssh-host.example.com as user user with private key stored in ./my-proxy-private-key, run:

docker run -v $PWD/crawls/:/crawls/ -v $PWD/my-proxy-private-key:/tmp/private-key webrecorder/browsertrix-crawler crawl --url https://httpbin.org/ip --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key

To also provide the host public key (eg. ./known_hosts file) for additional verification, run:

docker run -v $PWD/crawls/:/crawls/ -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler crawl --url https://httpbin.org/ip --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts

The host key will only be checked if provided in a file via: --sshProxyKnownHostsFile.

A custom SSH port can be provided with --proxyServer ssh://user@path-to-ssh-host.example.com:2222, otherwise the connection will be attempted via the default SSH port (port 22).

The SSH connection establishes a tunnel on a local port in the container (9722) which will forward inbound/outbound traffic through the remote proxy. The autossh utility is used to automatically restart the SSH connection, if needed.

Only key-based authentication is supposed for SSH proxies for now.

Browser Profiles

The above proxy settings also apply to Browser Profile Creation, and browser profiles can also be created using proxies, for example:

docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts