cmoncrawl.processor.pipeline.router

Classes

class cmoncrawl.processor.pipeline.router.IRouter

Base class for all routers

abstract route(url: str | None, time: datetime | None, metadata: PipeMetadata) IExtractor

Routes the url to the correct extractor

class cmoncrawl.processor.pipeline.router.Route(name: str, regexes: List[re.Pattern[str]], since: datetime.datetime, to: datetime.datetime)
class cmoncrawl.processor.pipeline.router.Router
load_module_as_extractor(module_path: Path)

Loads a module and returns its extractor

register_route(name: str, regex: str | List[str], since: datetime | None = None, to: datetime | None = None)

Registers a route for a given extractor name and regex

Parameters:
  • name (str) – The name of the extractor

  • regex (Union[str, List[str]]) – The regex to match against

  • since (datetime | None, optional) – The earliest time to route to this extractor. Defaults to None.

  • to (datetime | None, optional) – The latest time to route to this extractor. Defaults to None.

route(url: str | None, time: datetime | None, metadata: PipeMetadata) IExtractor

Routes the url to the correct extractor based on the url and time

Parameters:
  • url (str | None) – The url to route

  • time (datetime | None) – The time to route

  • metadata (PipeMetadata) – The metadata for the current pipeline