cmoncrawl.processor.pipeline.router
Classes
- class cmoncrawl.processor.pipeline.router.IRouter
Base class for all routers
- abstract route(url: str | None, time: datetime | None, metadata: PipeMetadata) IExtractor
Routes the url to the correct extractor
- class cmoncrawl.processor.pipeline.router.Route(name: str, regexes: List[re.Pattern[str]], since: datetime.datetime, to: datetime.datetime)
- class cmoncrawl.processor.pipeline.router.Router
- load_module_as_extractor(module_path: Path)
Loads a module and returns its extractor
- register_route(name: str, regex: str | List[str], since: datetime | None = None, to: datetime | None = None)
Registers a route for a given extractor name and regex
- Parameters:
name (str) – The name of the extractor
regex (Union[str, List[str]]) – The regex to match against
since (datetime | None, optional) – The earliest time to route to this extractor. Defaults to None.
to (datetime | None, optional) – The latest time to route to this extractor. Defaults to None.
- route(url: str | None, time: datetime | None, metadata: PipeMetadata) IExtractor
Routes the url to the correct extractor based on the url and time
- Parameters:
url (str | None) – The url to route
time (datetime | None) – The time to route
metadata (PipeMetadata) – The metadata for the current pipeline