Command Line Download

The download mode of the cmon command line tool serves to query and download from CommonCrawl indexes. The following arguments are needed in this order:

Positional arguments

  1. output - Path to output directory.

  2. {record,html} - Download mode:

    • record: Download record files from Common Crawl.

    • html: Download HTML files from Common Crawl.

  3. urls - URLs to download, e.g. www.bcc.cz.

In html mode, the output directory will contain .html files, one for each found URL. In record mode, the output directory will contain .jsonl files, each containing multiple domain records in JSON format.

Options

--limit LIMIT

Max number of URLs to download.

--since SINCE

Start date in ISO format (e.g., 2020-01-01).

--to TO

End date in ISO format (e.g., 2020-01-01).

--cc_server CC_SERVER

Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index).

--max_retry MAX_RETRY

Max number of retries for a request. Increase this number when requests are failing.

--sleep_base SLEEP_BASE

Base sleep time for exponential backoff in case of request failure.

--max_requests_per_second MAX_REQUESTS_PER_SECOND

Max number of requests per second.

--match_type MATCH_TYPE

One of exact, prefix, host, domain Match type for the URL. Refer to cdx-api for more information. See cmoncrawl.common.types.MatchType for more information.

--max_directory_size MAX_DIRECTORY_SIZE

Max number of files per directory.

--filter_non_200

Filter out non-200 status code.

--aggregator AGGREGATOR

Aggregator to use for the query.

  • athena: Athena aggregator. Fastest, but requires AWS credentials with correct permissions. See Athena for more information.

  • gateway: Gateway aggregator (default). Very slow, but no need for AWS config.

--s3_bucket S3_BUCKET

S3 bucket to use for Athena aggregator. Only needed if using Athena aggregator.

  • If set the bucket will not be deleted after the query is done, allowing to reuse it for future queries.

  • If not set, a temporary bucket will be created and deleted after the query is done.

Note

If you specify an S3 bucket, remember to delete it manually after you’re done to avoid incurring unnecessary costs.

Record mode options

--max_crawls_per_file MAX_CRAWLS_PER_FILE

Max number of domain records per file output

HTML mode options

--encoding ENCODING

Force usage of specified encoding if possible.

--download_method DOWNLOAD_METHOD

Method for downloading warc files from Common Crawl, it only applies to HTML download.

  • api: Download from Common Crawl API Gateway. This is the default option.

  • s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions.

Examples

# Download first 1000 domain records for example.com
cmon download dr_output record --match_type=domain --limit=1000 example.com

# Download first 100 htmls for example.com
cmon download html_output html --match_type=domain --limit=100 example.com