cmoncrawl.processor.extraction.utils

Functions

all_same_transform(dict, fc)

Applies fc to all values in dict and returns a dict with same keys but with transformed values.

chain_transforms(trans)

Chains transforms together.

check_required(required_fields, extractor_name)

Checks if required fields are present in the extracted dict.

combine_dicts(dicts)

Combines list of dictioneries into one.

extract_transform(tag, extract_dict, ...)

Extracts data from tag using extract_dict defining what to extract and how to name it, and extract_transform_dict defining how to transform the extracted data.

get_attribute_transform(attr_name)

Returns a function that takes a bs4 tag and returns the value of the attribute attr_name or None if the attribute doesn't exist.

get_tag_transform(tag_desc)

Returns a function that takes a bs4 tag and returns the first tag that matches the tag_desc.

get_tags_transform(tag_desc)

Returns a function that takes a bs4 tag and returns a list of tags that match the tag_desc.

get_text_list_transform([sep])

Returns a function that takes a list of bs4 tags and returns a string with all the text from the tags joined with sep.

get_text_transform(tag[, recursive])

Returns text from tag.

transform(dict, transforms)

Transforms dict using transforms dict.