Command Line Interface

The command line interface is a simple wrapper around the library.

It provides the two main functionalities:

  • download - Downloads samples of either Domain Record or HTML from common crawl indexes

  • extract - Downloads an HTML from Domain Record and extracts the content. It can also directly take the HTML and extract the data.

Both functionalities are invoked using cmon followed by the functionality and the required arguments. The cmon command also takes a few optional arguments:

--verbosity

Verbosity level. Choices are [0, 1, 2], with 0 being the least verbose and 2 being the most verbose. Default is 1.

--aws_profile

AWS profile to use for AWS calls (Athena, S3). If not provided, the default AWS profile will be used.

Examples

# Download first 1000 domain records for example.com
cmon download --match_type=domain --limit=1000 dr_output record example.com

# Download first 100 htmls for example.com
cmon download --match_type=domain --limit=100 html_output html example.com

# Take the domain records downloaded using the first command and extracts them using your extractors
cmon extract config.json extracted_output dr_output/*.jsonl record

# Take the htmls downloaded using the second command and extracts them using your extractors
cmon extract config.json extracted_output html_output/*.html html