cmoncrawl.common.types
Functions
|
Classes
- class cmoncrawl.common.types.DomainCrawl(url: str = '', cdx_server: str = '', page: int = 0)
Domain crawl.
- class cmoncrawl.common.types.DomainRecord(*, filename: str, url: str | None, offset: int, length: int, digest: str | None = None, encoding: str | None = None, timestamp: datetime | None = None)
Domain record.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'digest': FieldInfo(annotation=Union[str, NoneType], required=False), 'encoding': FieldInfo(annotation=Union[str, NoneType], required=False), 'filename': FieldInfo(annotation=str, required=True), 'length': FieldInfo(annotation=int, required=True), 'offset': FieldInfo(annotation=int, required=True), 'timestamp': FieldInfo(annotation=Union[datetime, NoneType], required=False), 'url': FieldInfo(annotation=Union[str, NoneType], required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class cmoncrawl.common.types.ExtractConfig(*, extractors_path: Path, routes: List[RoutesConfig])
Configuration for run.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'extractors_path': FieldInfo(annotation=Path, required=True), 'routes': FieldInfo(annotation=List[RoutesConfig], required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class cmoncrawl.common.types.ExtractorConfig(*, name: str, since: datetime | None = None, to: datetime | None = None)
Configuration for extractor.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'name': FieldInfo(annotation=str, required=True), 'since': FieldInfo(annotation=Union[datetime, NoneType], required=False), 'to': FieldInfo(annotation=Union[datetime, NoneType], required=False)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class cmoncrawl.common.types.MatchType(value)
Match type for cdx server. See https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#url-match-scope
Example: Query: example.com/abc
Matches: EXACT: (www.)?example.com/abc PREFIX: (www.)?example.com/abc(/.*)? HOST: (www.)?example.com(/.*)? DOMAIN: (.*.)?example.com(/.*)?
- class cmoncrawl.common.types.PipeMetadata(domain_record: ~cmoncrawl.common.types.DomainRecord, article_data: ~typing.Dict[~typing.Any, ~typing.Any] = <factory>, warc_header: ~typing.Dict[str, ~typing.Any] = <factory>, http_header: ~typing.Dict[str, ~typing.Any] = <factory>, rec_type: str | None = None, encoding: str = 'latin-1', name: str | None = None)
Metadata for a pipe.
Attributes: domain_record: DomainRecord
An instance of the DomainRecord class representing associated domain record, eg. pointer to the WARC file.
- article_data: Dict[Any, Any] = field(default_factory=dict)
A dictionary storing article data with keys and values of any type. Those are the data extracted using Extractors.
- warc_header: Dict[str, Any] = field(default_factory=dict)
A dictionary storing the WARC header metadata.
- http_header: Dict[str, Any] = field(default_factory=dict)
A dictionary storing the HTTP header information.
- rec_type: str | None = None
A string or None representing the type of record.
- encoding: str = “latin-1”
A string representing the character encoding used for the record. The default value is “latin-1”.
- name: str | None = None
A string or None representing the name associated with the record.
- class cmoncrawl.common.types.RetrieveResponse(content: Any)
Response from retrieve.
- class cmoncrawl.common.types.RoutesConfig(*, regexes: List[str] = [], extractors: List[ExtractorConfig] = [])
Configuration for extractors.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'extractors': FieldInfo(annotation=List[ExtractorConfig], required=False, default=[]), 'regexes': FieldInfo(annotation=List[str], required=False, default=[])}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.