cmoncrawl.common.types

Functions

parse_timestamp(v)

Classes

class cmoncrawl.common.types.DomainCrawl(url: str = '', cdx_server: str = '', page: int = 0)

Domain crawl.

class cmoncrawl.common.types.DomainRecord(*, filename: str, url: str | None, offset: int, length: int, digest: str | None = None, encoding: str | None = None, timestamp: datetime | None = None)

Domain record.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'digest': FieldInfo(annotation=Union[str, NoneType], required=False), 'encoding': FieldInfo(annotation=Union[str, NoneType], required=False), 'filename': FieldInfo(annotation=str, required=True), 'length': FieldInfo(annotation=int, required=True), 'offset': FieldInfo(annotation=int, required=True), 'timestamp': FieldInfo(annotation=Union[datetime, NoneType], required=False), 'url': FieldInfo(annotation=Union[str, NoneType], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class cmoncrawl.common.types.ExtractConfig(*, extractors_path: Path, routes: List[RoutesConfig])

Configuration for run.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'extractors_path': FieldInfo(annotation=Path, required=True), 'routes': FieldInfo(annotation=List[RoutesConfig], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class cmoncrawl.common.types.ExtractorConfig(*, name: str, since: datetime | None = None, to: datetime | None = None)

Configuration for extractor.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'name': FieldInfo(annotation=str, required=True), 'since': FieldInfo(annotation=Union[datetime, NoneType], required=False), 'to': FieldInfo(annotation=Union[datetime, NoneType], required=False)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class cmoncrawl.common.types.MatchType(value)

Match type for cdx server. See https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#url-match-scope

Example: Query: example.com/abc

Matches: EXACT: (www.)?example.com/abc PREFIX: (www.)?example.com/abc(/.*)? HOST: (www.)?example.com(/.*)? DOMAIN: (.*.)?example.com(/.*)?

class cmoncrawl.common.types.PipeMetadata(domain_record: ~cmoncrawl.common.types.DomainRecord, article_data: ~typing.Dict[~typing.Any, ~typing.Any] = <factory>, warc_header: ~typing.Dict[str, ~typing.Any] = <factory>, http_header: ~typing.Dict[str, ~typing.Any] = <factory>, rec_type: str | None = None, encoding: str = 'latin-1', name: str | None = None)

Metadata for a pipe.

Attributes: domain_record: DomainRecord

An instance of the DomainRecord class representing associated domain record, eg. pointer to the WARC file.

article_data: Dict[Any, Any] = field(default_factory=dict)

A dictionary storing article data with keys and values of any type. Those are the data extracted using Extractors.

warc_header: Dict[str, Any] = field(default_factory=dict)

A dictionary storing the WARC header metadata.

http_header: Dict[str, Any] = field(default_factory=dict)

A dictionary storing the HTTP header information.

rec_type: str | None = None

A string or None representing the type of record.

encoding: str = “latin-1”

A string representing the character encoding used for the record. The default value is “latin-1”.

name: str | None = None

A string or None representing the name associated with the record.

class cmoncrawl.common.types.RetrieveResponse(content: Any)

Response from retrieve.

class cmoncrawl.common.types.RoutesConfig(*, regexes: List[str] = [], extractors: List[ExtractorConfig] = [])

Configuration for extractors.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'extractors': FieldInfo(annotation=List[ExtractorConfig], required=False, default=[]), 'regexes': FieldInfo(annotation=List[str], required=False, default=[])}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.