cmoncrawl.aggregator.athena_query

Classes

class cmoncrawl.aggregator.athena_query.AthenaAggregator(urls: List[str], match_type: MatchType = MatchType.EXACT, cc_servers: List[str] | None = None, since: datetime = datetime.datetime(1, 1, 1, 0, 0), to: datetime = datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), limit: int | None = None, prefetch_size: int = 2, sleep_base: float = 1.3, max_retry: int = 5, extra_sql_where_clause: str | None = None, batch_size: int = 1, aws_profile: str | None = None, bucket_name: str | None = None, catalog_name: str = 'AwsDataCatalog', database_name: str = 'commoncrawl', table_name: str = 'ccindex')

This class is responsible for aggregating the index files from commoncrawl using AWS Athena. It is an async context manager which can then be used as an async iterator which yields DomainRecord objects, found in the index files of commoncrawl.

It uses the AWS Athena to query from s3 the index files of commoncrawl.

Parameters:
  • urls (List[str]) – A list of urls to search for.

  • cc_indexes_server (str, optional) – The commoncrawl index server to use. Defaults to “http://index.commoncrawl.org/collinfo.json”.

  • match_type (MatchType, optional) – Match type for cdx-api. Defaults to MatchType.EXACT.

  • cc_servers (List[str], optional) – A list of commoncrawl servers to use. If None, then indexes will be retrieved from the cc_indexes_server. Defaults to None.

  • since (datetime, optional) – The start date for the search. Defaults to datetime.min.

  • to (datetime, optional) – The end date for the search. Defaults to datetime.max.

  • limit (int, optional) – The maximum number of results to return. Defaults to None.

  • prefetch_size (int, optional) – The number of indexes to fetch concurrently. Defaults to 3.

  • max_retry (int, optional) – The maximum number of retries for a single request. Defaults to 5.

  • extra_sql_where_clause (str, optional) – Additional SQL WHERE clause to append to the Athena query. Defaults to None.

  • batch_size (int) – How many crawls to query at once. Defaults to 1. If <= 0, all crawls will be queried at once.

  • aws_profile (str, optional) – The AWS profile to use for Athena and S3. Defaults to “default”.

  • bucket_name (str, optional) – The S3 bucket to use for Athena query results. If None, a new bucket will be created. Defaults to None.

  • catalog_name (str, optional) – The Athena catalog to use. Defaults to “AwsDataCatalog”.

  • database_name (str, optional) – The Athena database to use. Defaults to “commoncrawl”.

  • table_name (str, optional) – The Athena table to use. Defaults to “ccindex”.

Examples

>>> async with AthenaAggregator(["example.com"]) as aggregator:
>>>     async for domain_record in aggregator:
>>>         print(domain_record)
class AthenaAggregatorIterator(aws_client: Session, urls: List[str], cc_servers: List[str], match_type: MatchType, since: datetime | None, to: datetime | None, limit: int | None, prefetch_size: int, sleep_base: float, max_retry: int, batch_size: int, extra_sql_where_clause: str | None, bucket_name: str, database_name: str, table_name: str)