cmoncrawl.processor.pipeline.extractor

Classes

class cmoncrawl.processor.pipeline.extractor.BaseExtractor(encoding: str | None = None, raise_on_encoding: bool = False, parser: str = 'html.parser')

Base class for all soup extractors

Parameters:

encoding (str, optional) – Default encoding to be used. Defaults to None.
raise_on_encoding (bool, optional) – If True, the extractor will raise ValueException if it fails to decode the response. Defaults to False.

extract(response: str, metadata: PipeMetadata) → Dict[str, Any] | None

Extracts the data from the response, if the extractor fails to extract the data it should return None

Parameters:

response (str) – response from the downloader
metadata (PipeMetadata) – Metadata of the response

class cmoncrawl.processor.pipeline.extractor.DomainRecordExtractor(filter_non_ok: bool = True)

Dummy Extractor which simply extracts the domain record

Parameters:: filter_non_ok (bool, optional) – If True, only 200 status codes will be extracted. Defaults to True.

class cmoncrawl.processor.pipeline.extractor.HTMLExtractor(filter_non_ok: bool = True, encoding: str | None = None)

Dummy Extractor which simply extracts the html

Parameters:

filter_non_ok (bool, optional) – If True, only 200 status codes will be extracted. Defaults to True.
encoding (str, optional) – Default encoding to be used. Defaults to None. If set, the extractor will raise ValueException if it fails to decode the response.

class cmoncrawl.processor.pipeline.extractor.IExtractor

Base class for all extractors

abstract extract(response: str, metadata: PipeMetadata) → Dict[str, Any] | None

Extracts the data from the response, if the extractor fails to extract the data it should return None

Parameters:

response (str) – response from the downloader
metadata (PipeMetadata) – Metadata of the response

class cmoncrawl.processor.pipeline.extractor.PageExtractor(header_css_dict: Dict[str, str] = {}, header_extract_dict: Dict[str, Callable[[Any], Any] | List[Callable[[Any], Any]]] = {}, content_css_selector: str = 'body', content_css_dict: Dict[str, str] = {}, content_extract_dict: Dict[str, Callable[[Any], Any] | List[Callable[[Any], Any]]] = {}, css_selectors_must_exist: List[str] = [], css_selectors_must_not_exist: List[str] = [], allowed_domain_prefixes: List[str] | None = None, is_valid_extraction: Callable[[Dict[Any, Any], PipeMetadata], bool] | None = None, encoding: str | None = None)

The PageExtractor is designed to extracte specific elements from a web page, while adding ability to choose when to extract the data.

Parameters:

header_css_dict (Dict[str, str]) – A dictionary specifying the CSS selectors for the header elements.
header_extract_dict (Dict[str, List[Callable[[Any], Any]] | Callable[[Any], Any]]) – A dictionary specifying the extraction functions for the header elements. The keys must match the keys in the header_css_dict. The functions are applied in the order they are specified in the list.
content_css_selector (str) – The CSS selector specifying where the content elements are located.
content_css_dict (Dict[str, str]) – A dictionary specifying the CSS selectors for the content elements. Selectors must be relative to the content_css_selector.
content_extract_dict (Dict[str, List[Callable[[Any], Any]] | Callable[[Any], Any]]) – A dictionary specifying the extraction functions for the content elements. The keys must match the keys in the content_css_dict. The functions are applied in the order they are specified in the list.
css_selectors_must_exist (List[str]) – A list of CSS selectors that must exist for the extraction to proceed.
css_selectors_must_not_exist (List[str]) – A list of CSS selectors that must not exist for the extraction to proceed.
allowed_domain_prefixes (List[str] | None) – A list of allowed domain prefixes. If None, all domain prefixes are allowed.
is_valid_extraction (Callable[[Dict[Any, Any], PipeMetadata], bool]) – A function that takes in the extracted data and the metadata and returns True if the extraction is valid, False otherwise.
encoding (str | None) – The encoding to be used. If None, the default encoding is used.

Returns:

A dictionary containing the extracted data, or None if the extraction failed.

Return type:

Dict[Any, Any] | None

extract(response: str, metadata: PipeMetadata) → Dict[Any, Any] | None

Extracts the data from the response, if the extractor fails to extract the data it should return None

Parameters:

response (str) – response from the downloader
metadata (PipeMetadata) – Metadata of the response