cmoncrawl.processor.dao.s3
Classes
- class cmoncrawl.processor.dao.s3.S3Dao(aws_profile: str | None = None, bucket_name: str = 'commoncrawl')
S3Dao is a class that provides methods to interact with AWS S3 for downloading warc files from the commoncrawl bucket.
- Parameters:
aws_profile (str, optional) – The AWS profile to use for the download. Defaults to None.
bucket_name (str, optional) – The name of the S3 bucket. Defaults to “commoncrawl”.
- bucket_name
The name of the S3 bucket.
- Type:
str
- aws_profile
The AWS profile to use for the download.
- Type:
str
- client
The S3 client.
- Type:
aioboto3.client
- __aenter__()
Asynchronous context manager method to initialize the S3 client.
- __aexit__(exc_type, exc, tb)
Asynchronous context manager method to clean up the S3 client.
- fetch(domain_record)
Downloads a warc file from the commoncrawl bucket using S3 and returns its bytes.
- Raises:
ValueError – If the S3Dao client is not initialized.
- async fetch(domain_record: DomainRecord) bytes
Downloads a warc file from commoncrawl bucket using s3 and returns its bytes.
- Parameters:
domain_record (DomainRecord) – The domain record to use for the download.
aws_profile (str) – The AWS profile to use for the download.
- Returns:
The bytes of the downloaded warc file.
- Return type:
bytes