Command Line Download ===================== The download mode of the ``cmon`` command line tool serves to query and download from CommonCrawl indexes. The following arguments are needed in this order: Positional arguments -------------------- 1. output - Path to output directory. 2. {record,html} - Download mode: - record: Download record files from Common Crawl. - html: Download HTML files from Common Crawl. 3. urls - URLs to download, e.g. www.bcc.cz. In html mode, the output directory will contain .html files, one for each found URL. In record mode, the output directory will contain ``.jsonl`` files, each containing multiple domain records in JSON format. Options ------- --limit LIMIT Max number of URLs to download. --since SINCE Start date in ISO format (e.g., 2020-01-01). --to TO End date in ISO format (e.g., 2020-01-01). --cc_server CC_SERVER Common Crawl indexes to query. Must provide the whole URL (e.g., https://index.commoncrawl.org/CC-MAIN-2023-14-index). --max_retry MAX_RETRY Max number of retries for a request. Increase this number when requests are failing. --sleep_base SLEEP_BASE Base sleep time for exponential backoff in case of request failure. --max_requests_per_second MAX_REQUESTS_PER_SECOND Max number of requests per second. --match_type MATCH_TYPE One of exact, prefix, host, domain Match type for the URL. Refer to cdx-api for more information. See :py:class:`cmoncrawl.common.types.MatchType` for more information. --max_directory_size MAX_DIRECTORY_SIZE Max number of files per directory. --filter_non_200 Filter out non-200 status code. --aggregator AGGREGATOR Aggregator to use for the query. - athena: Athena aggregator. Fastest, but requires AWS credentials with correct permissions. See :ref:`misc/athena:Athena` for more information. - gateway: Gateway aggregator (default). Very slow, but no need for AWS config. --s3_bucket S3_BUCKET S3 bucket to use for Athena aggregator. Only needed if using Athena aggregator. - If set the bucket will not be deleted after the query is done, allowing to reuse it for future queries. - If not set, a temporary bucket will be created and deleted after the query is done. .. note:: If you specify an S3 bucket, remember to delete it manually after you're done to avoid incurring unnecessary costs. Record mode options ------------------- --max_crawls_per_file MAX_CRAWLS_PER_FILE Max number of domain records per file output HTML mode options ----------------- --encoding ENCODING Force usage of specified encoding if possible. --download_method DOWNLOAD_METHOD Method for downloading warc files from Common Crawl, it only applies to HTML download. - api: Download from Common Crawl API Gateway. This is the default option. - s3: Download from Common Crawl S3 bucket. This is the fastest option, but requires AWS credentials with correct permissions. Examples -------- .. code-block:: bash # Download first 1000 domain records for example.com cmon download dr_output record --match_type=domain --limit=1000 example.com # Download first 100 htmls for example.com cmon download html_output html --match_type=domain --limit=100 example.com