Extraction utils
The utilies for extraction are defined cmoncrawl.processor.extraction
.
It provides helper function for both filtering and extraction.
Filtering
must_exist_filter`: filter out the ulrs that don’t contain css selector
must_not_exist_filter: filter out the ulrs that contain css selector
Extraction
– check_required: Creates a function that checks if all the required fileds are present in the extracted data
– chain_transform: Creates a function that chains multiple transformation function, if any return None, the chain is broken and None is returned. Especially usefull with soup select etc…
– extract_transform: Creates a function that extracts the data from the soup tag using the css selector and transforms it using your transformation functions.