cmoncrawl.processor.pipeline.downloader
Functions
|
Classes
- class cmoncrawl.processor.pipeline.downloader.AsyncDownloader(dao: ICC_Dao, digest_verification: bool = True, max_retry: int = 5, sleep_base: float = 1.3, max_requests_per_second: int = 20, encoding: str = 'latin-1')
Downloader which asynchronously downloads the the data for the domain_record
- Parameters:
dao (ICC_Dao) – Data access object to use for downloading
digest_verification (bool, optional) – Whether to verify the digest of the downloaded data. Defaults to True.
max_retry (int, optional) – Maximum number of retries. Defaults to 5.
sleep_base (float, optional) – Base sleep time for exponential backoff in retries. Defaults to 1.5.
max_requests_per_second (int, optional) – Maximum number of requests per second. Defaults to 20.
encoding – Default encoding to be used
- class cmoncrawl.processor.pipeline.downloader.DownloaderLocalFiles(files: List[Path], url: str | None = None, date: datetime | None = None)
Local file downloader and metadata extractor for testing It doesn’t download anything but passes local files further in the pipeline and extracts metadata from the file
- Parameters:
files (List[Path]) – List of local files to pass
url (str, optional) – Url to use for metadata. Defaults to None.
date (datetime, optional) – Date to add to metadata. Defaults to None.
- class cmoncrawl.processor.pipeline.downloader.DummyDownloader
A dummy downloader class that does not perform any actual downloading. It simply adds an empty string as the content and passes the domain record further into the pipeline.
- async download(domain_record: DomainRecord | None)
Downloads the content for the given domain record.
- Parameters:
domain_record (DomainRecord | None) – The domain record to download.
- Returns:
- A list containing a single tuple with an empty string as the first element
and the pipe metadata as the second element.
- Return type:
List[Tuple[str, PipeMetadata]]
- class cmoncrawl.processor.pipeline.downloader.IDownloader
Base class for all downloaders
- class cmoncrawl.processor.pipeline.downloader.WarcIterator(file: Path, encoding: str = 'latin-1', show_progress: bool = False)
WarcIterator is local downloader which iterates over the specified warc file
- Parameters:
file (Path) – Path to the warc file
encoding (str, optional) – Encoding to be used. Defaults to “latin-1”.