Extraction utils

The utilies for extraction are defined cmoncrawl.processor.extraction. It provides helper function for both filtering and extraction.

Filtering

  • must_exist_filter`: filter out the ulrs that don’t contain css selector

  • must_not_exist_filter: filter out the ulrs that contain css selector

Extraction

check_required: Creates a function that checks if all the required fileds are present in the extracted data

chain_transform: Creates a function that chains multiple transformation function, if any return None, the chain is broken and None is returned. Especially usefull with soup select etc…

extract_transform: Creates a function that extracts the data from the soup tag using the css selector and transforms it using your transformation functions.