Command Line Download
The download mode of the cmon
command line tool serves to query and download from CommonCrawl indexes.
The following arguments are needed in this order:
Positional arguments
output - Path to output directory.
{record,html} - Download mode:
record: Download record files from Common Crawl.
html: Download HTML files from Common Crawl.
urls - URLs to download, e.g. www.bcc.cz.
In html mode, the output directory will contain .html files, one
for each found URL. In record mode, the output directory will contain
.jsonl
files, each containing multiple domain records in JSON format.
Options
- --limit LIMIT
Max number of URLs to download.
- --since SINCE
Start date in ISO format (e.g., 2020-01-01).
- --to TO
End date in ISO format (e.g., 2020-01-01).
- --cc_server CC_SERVER
Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index).
- --max_retry MAX_RETRY
Max number of retries for a request. Increase this number when requests are failing.
- --sleep_base SLEEP_BASE
Base sleep time for exponential backoff in case of request failure.
- --max_requests_per_second MAX_REQUESTS_PER_SECOND
Max number of requests per second.
- --match_type MATCH_TYPE
One of exact, prefix, host, domain Match type for the URL. Refer to cdx-api for more information. See
cmoncrawl.common.types.MatchType
for more information.- --max_directory_size MAX_DIRECTORY_SIZE
Max number of files per directory.
- --filter_non_200
Filter out non-200 status code.
- --aggregator AGGREGATOR
Aggregator to use for the query.
athena: Athena aggregator. Fastest, but requires AWS credentials with correct permissions. See Athena for more information.
gateway: Gateway aggregator (default). Very slow, but no need for AWS config.
- --s3_bucket S3_BUCKET
S3 bucket to use for Athena aggregator. Only needed if using Athena aggregator.
If set the bucket will not be deleted after the query is done, allowing to reuse it for future queries.
If not set, a temporary bucket will be created and deleted after the query is done.
Note
If you specify an S3 bucket, remember to delete it manually after you’re done to avoid incurring unnecessary costs.
Record mode options
- --max_crawls_per_file MAX_CRAWLS_PER_FILE
Max number of domain records per file output
HTML mode options
- --encoding ENCODING
Force usage of specified encoding if possible.
- --download_method DOWNLOAD_METHOD
Method for downloading warc files from Common Crawl, it only applies to HTML download.
api: Download from Common Crawl API Gateway. This is the default option.
s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions.
Examples
# Download first 1000 domain records for example.com
cmon download dr_output record --match_type=domain --limit=1000 example.com
# Download first 100 htmls for example.com
cmon download html_output html --match_type=domain --limit=100 example.com