cmoncrawl.aggregator.gateway_query

Classes

class cmoncrawl.aggregator.gateway_query.GatewayAggregator(urls: List[str], match_type: MatchType = MatchType.EXACT, cc_servers: List[str] | None = None, since: datetime = datetime.datetime(1, 1, 1, 0, 0), to: datetime = datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), limit: int | None = None, max_retry: int = 5, prefetch_size: int = 3, sleep_base: float = 1.3, max_requests_per_second: int = 20)

This class is responsible for aggregating the index files from commoncrawl. It is an async context manager which can then be used as an async iterator which yields DomainRecord objects, found in the index files of commoncrawl.

It uses the commoncrawl index server to find the index files.

Parameters:
  • urls (List[str]) – A list of urls to search for.

  • cc_indexes_server (str, optional) – The commoncrawl index server to use. Defaults to “http://index.commoncrawl.org/collinfo.json”.

  • match_type (MatchType, optional) – Match type for cdx-api. Defaults to None.

  • cc_servers (List[str], optional) – A list of commoncrawl servers to use. If None, then indexes will be retrieved from the cc_indexes_server. Defaults to None.

  • since (datetime, optional) – The start date for the search. Defaults to datetime.min.

  • to (datetime, optional) – The end date for the search. Defaults to datetime.max.

  • limit (int, optional) – The maximum number of results to return. Defaults to None.

  • max_retry (int, optional) – The maximum number of retries for a single request. Defaults to 5.

  • prefetch_size (int, optional) – The number of indexes to fetch concurrently. Defaults to 3.

  • sleep_base – float: The base for the exponential backoff time calculation between retries. Defaults to 1.5.

  • max_requests_per_second (int, optional) – The maximum number of requests per second. Defaults to 20.

Examples

>>> async with GatewayAggregator(["example.com"]) as aggregator:
>>>     async for domain_record in aggregator:
>>>         print(domain_record)
class GatewayAggregatorIterator(client: ClientSession, urls: List[str], CC_files: List[str], match_type: MatchType | None, since: datetime, to: datetime, limit: int | None, max_retry: int, prefetch_size: int, sleep_base: float)