cmoncrawl.processor.dao.s3

Classes

class cmoncrawl.processor.dao.s3.S3Dao(aws_profile: str | None = None, bucket_name: str = 'commoncrawl')

S3Dao is a class that provides methods to interact with AWS S3 for downloading warc files from the commoncrawl bucket.

Parameters:
  • aws_profile (str, optional) – The AWS profile to use for the download. Defaults to None.

  • bucket_name (str, optional) – The name of the S3 bucket. Defaults to “commoncrawl”.

bucket_name

The name of the S3 bucket.

Type:

str

aws_profile

The AWS profile to use for the download.

Type:

str

client

The S3 client.

Type:

aioboto3.client

__aenter__()

Asynchronous context manager method to initialize the S3 client.

__aexit__(exc_type, exc, tb)

Asynchronous context manager method to clean up the S3 client.

fetch(domain_record)

Downloads a warc file from the commoncrawl bucket using S3 and returns its bytes.

Raises:

ValueError – If the S3Dao client is not initialized.

async fetch(domain_record: DomainRecord) bytes

Downloads a warc file from commoncrawl bucket using s3 and returns its bytes.

Parameters:
  • domain_record (DomainRecord) – The domain record to use for the download.

  • aws_profile (str) – The AWS profile to use for the download.

Returns:

The bytes of the downloaded warc file.

Return type:

bytes