cmoncrawl.processor.pipeline.extractor

Classes

class cmoncrawl.processor.pipeline.extractor.BaseExtractor(encoding: str | None = None, raise_on_encoding: bool = False, parser: str = 'html.parser')

Base class for all soup extractors

Parameters:
  • encoding (str, optional) – Default encoding to be used. Defaults to None.

  • raise_on_encoding (bool, optional) – If True, the extractor will raise ValueException if it fails to decode the response. Defaults to False.

extract(response: str, metadata: PipeMetadata) Dict[str, Any] | None

Extracts the data from the response, if the extractor fails to extract the data it should return None

Parameters:
  • response (str) – response from the downloader

  • metadata (PipeMetadata) – Metadata of the response

class cmoncrawl.processor.pipeline.extractor.DomainRecordExtractor(filter_non_ok: bool = True)

Dummy Extractor which simply extracts the domain record

Parameters:

filter_non_ok (bool, optional) – If True, only 200 status codes will be extracted. Defaults to True.

class cmoncrawl.processor.pipeline.extractor.HTMLExtractor(filter_non_ok: bool = True, encoding: str | None = None)

Dummy Extractor which simply extracts the html

Parameters:
  • filter_non_ok (bool, optional) – If True, only 200 status codes will be extracted. Defaults to True.

  • encoding (str, optional) – Default encoding to be used. Defaults to None. If set, the extractor will raise ValueException if it fails to decode the response.

class cmoncrawl.processor.pipeline.extractor.IExtractor

Base class for all extractors

abstract extract(response: str, metadata: PipeMetadata) Dict[str, Any] | None

Extracts the data from the response, if the extractor fails to extract the data it should return None

Parameters:
  • response (str) – response from the downloader

  • metadata (PipeMetadata) – Metadata of the response

class cmoncrawl.processor.pipeline.extractor.PageExtractor(header_css_dict: Dict[str, str] = {}, header_extract_dict: Dict[str, Callable[[Any], Any] | List[Callable[[Any], Any]]] = {}, content_css_selector: str = 'body', content_css_dict: Dict[str, str] = {}, content_extract_dict: Dict[str, Callable[[Any], Any] | List[Callable[[Any], Any]]] = {}, css_selectors_must_exist: List[str] = [], css_selectors_must_not_exist: List[str] = [], allowed_domain_prefixes: List[str] | None = None, is_valid_extraction: Callable[[Dict[Any, Any], PipeMetadata], bool] | None = None, encoding: str | None = None)

The PageExtractor is designed to extracte specific elements from a web page, while adding ability to choose when to extract the data.

Parameters:
  • header_css_dict (Dict[str, str]) – A dictionary specifying the CSS selectors for the header elements.

  • header_extract_dict (Dict[str, List[Callable[[Any], Any]] | Callable[[Any], Any]]) – A dictionary specifying the extraction functions for the header elements. The keys must match the keys in the header_css_dict. The functions are applied in the order they are specified in the list.

  • content_css_selector (str) – The CSS selector specifying where the content elements are located.

  • content_css_dict (Dict[str, str]) – A dictionary specifying the CSS selectors for the content elements. Selectors must be relative to the content_css_selector.

  • content_extract_dict (Dict[str, List[Callable[[Any], Any]] | Callable[[Any], Any]]) – A dictionary specifying the extraction functions for the content elements. The keys must match the keys in the content_css_dict. The functions are applied in the order they are specified in the list.

  • css_selectors_must_exist (List[str]) – A list of CSS selectors that must exist for the extraction to proceed.

  • css_selectors_must_not_exist (List[str]) – A list of CSS selectors that must not exist for the extraction to proceed.

  • allowed_domain_prefixes (List[str] | None) – A list of allowed domain prefixes. If None, all domain prefixes are allowed.

  • is_valid_extraction (Callable[[Dict[Any, Any], PipeMetadata], bool]) – A function that takes in the extracted data and the metadata and returns True if the extraction is valid, False otherwise.

  • encoding (str | None) – The encoding to be used. If None, the default encoding is used.

Returns:

A dictionary containing the extracted data, or None if the extraction failed.

Return type:

Dict[Any, Any] | None

extract(response: str, metadata: PipeMetadata) Dict[Any, Any] | None

Extracts the data from the response, if the extractor fails to extract the data it should return None

Parameters:
  • response (str) – response from the downloader

  • metadata (PipeMetadata) – Metadata of the response