cmoncrawl.processor.pipeline.extractor
Classes
- class cmoncrawl.processor.pipeline.extractor.BaseExtractor(encoding: str | None = None, raise_on_encoding: bool = False, parser: str = 'html.parser')
Base class for all soup extractors
- Parameters:
encoding (str, optional) – Default encoding to be used. Defaults to None.
raise_on_encoding (bool, optional) – If True, the extractor will raise ValueException if it fails to decode the response. Defaults to False.
- extract(response: str, metadata: PipeMetadata) Dict[str, Any] | None
Extracts the data from the response, if the extractor fails to extract the data it should return None
- Parameters:
response (str) – response from the downloader
metadata (PipeMetadata) – Metadata of the response
- class cmoncrawl.processor.pipeline.extractor.DomainRecordExtractor(filter_non_ok: bool = True)
Dummy Extractor which simply extracts the domain record
- Parameters:
filter_non_ok (bool, optional) – If True, only 200 status codes will be extracted. Defaults to True.
- class cmoncrawl.processor.pipeline.extractor.HTMLExtractor(filter_non_ok: bool = True, encoding: str | None = None)
Dummy Extractor which simply extracts the html
- Parameters:
filter_non_ok (bool, optional) – If True, only 200 status codes will be extracted. Defaults to True.
encoding (str, optional) – Default encoding to be used. Defaults to None. If set, the extractor will raise ValueException if it fails to decode the response.
- class cmoncrawl.processor.pipeline.extractor.IExtractor
Base class for all extractors
- abstract extract(response: str, metadata: PipeMetadata) Dict[str, Any] | None
Extracts the data from the response, if the extractor fails to extract the data it should return None
- Parameters:
response (str) – response from the downloader
metadata (PipeMetadata) – Metadata of the response
- class cmoncrawl.processor.pipeline.extractor.PageExtractor(header_css_dict: Dict[str, str] = {}, header_extract_dict: Dict[str, Callable[[Any], Any] | List[Callable[[Any], Any]]] = {}, content_css_selector: str = 'body', content_css_dict: Dict[str, str] = {}, content_extract_dict: Dict[str, Callable[[Any], Any] | List[Callable[[Any], Any]]] = {}, css_selectors_must_exist: List[str] = [], css_selectors_must_not_exist: List[str] = [], allowed_domain_prefixes: List[str] | None = None, is_valid_extraction: Callable[[Dict[Any, Any], PipeMetadata], bool] | None = None, encoding: str | None = None)
The PageExtractor is designed to extracte specific elements from a web page, while adding ability to choose when to extract the data.
- Parameters:
header_css_dict (Dict[str, str]) – A dictionary specifying the CSS selectors for the header elements.
header_extract_dict (Dict[str, List[Callable[[Any], Any]] | Callable[[Any], Any]]) – A dictionary specifying the extraction functions for the header elements. The keys must match the keys in the header_css_dict. The functions are applied in the order they are specified in the list.
content_css_selector (str) – The CSS selector specifying where the content elements are located.
content_css_dict (Dict[str, str]) – A dictionary specifying the CSS selectors for the content elements. Selectors must be relative to the content_css_selector.
content_extract_dict (Dict[str, List[Callable[[Any], Any]] | Callable[[Any], Any]]) – A dictionary specifying the extraction functions for the content elements. The keys must match the keys in the content_css_dict. The functions are applied in the order they are specified in the list.
css_selectors_must_exist (List[str]) – A list of CSS selectors that must exist for the extraction to proceed.
css_selectors_must_not_exist (List[str]) – A list of CSS selectors that must not exist for the extraction to proceed.
allowed_domain_prefixes (List[str] | None) – A list of allowed domain prefixes. If None, all domain prefixes are allowed.
is_valid_extraction (Callable[[Dict[Any, Any], PipeMetadata], bool]) – A function that takes in the extracted data and the metadata and returns True if the extraction is valid, False otherwise.
encoding (str | None) – The encoding to be used. If None, the default encoding is used.
- Returns:
A dictionary containing the extracted data, or None if the extraction failed.
- Return type:
Dict[Any, Any] | None
- extract(response: str, metadata: PipeMetadata) Dict[Any, Any] | None
Extracts the data from the response, if the extractor fails to extract the data it should return None
- Parameters:
response (str) – response from the downloader
metadata (PipeMetadata) – Metadata of the response