cmoncrawl.processor.pipeline.downloader

Functions

log_after_retry(retry_state)

Classes

class cmoncrawl.processor.pipeline.downloader.AsyncDownloader(dao: ICC_Dao, digest_verification: bool = True, max_retry: int = 5, sleep_base: float = 1.3, max_requests_per_second: int = 20, encoding: str = 'latin-1')

Downloader which asynchronously downloads the the data for the domain_record

Parameters:
  • dao (ICC_Dao) – Data access object to use for downloading

  • digest_verification (bool, optional) – Whether to verify the digest of the downloaded data. Defaults to True.

  • max_retry (int, optional) – Maximum number of retries. Defaults to 5.

  • sleep_base (float, optional) – Base sleep time for exponential backoff in retries. Defaults to 1.5.

  • max_requests_per_second (int, optional) – Maximum number of requests per second. Defaults to 20.

  • encoding – Default encoding to be used

class cmoncrawl.processor.pipeline.downloader.DownloaderLocalFiles(files: List[Path], url: str | None = None, date: datetime | None = None)

Local file downloader and metadata extractor for testing It doesn’t download anything but passes local files further in the pipeline and extracts metadata from the file

Parameters:
  • files (List[Path]) – List of local files to pass

  • url (str, optional) – Url to use for metadata. Defaults to None.

  • date (datetime, optional) – Date to add to metadata. Defaults to None.

class cmoncrawl.processor.pipeline.downloader.DummyDownloader

A dummy downloader class that does not perform any actual downloading. It simply adds an empty string as the content and passes the domain record further into the pipeline.

async download(domain_record: DomainRecord | None)

Downloads the content for the given domain record.

Parameters:

domain_record (DomainRecord | None) – The domain record to download.

Returns:

A list containing a single tuple with an empty string as the first element

and the pipe metadata as the second element.

Return type:

List[Tuple[str, PipeMetadata]]

class cmoncrawl.processor.pipeline.downloader.IDownloader

Base class for all downloaders

class cmoncrawl.processor.pipeline.downloader.WarcIterator(file: Path, encoding: str = 'latin-1', show_progress: bool = False)

WarcIterator is local downloader which iterates over the specified warc file

Parameters:
  • file (Path) – Path to the warc file

  • encoding (str, optional) – Encoding to be used. Defaults to “latin-1”.