cmoncrawl.processor.pipeline.streamer
Classes
- class cmoncrawl.processor.pipeline.streamer.BaseStreamerFile(root: Path, max_directory_size: int, max_file_size: int, extension: str, directory_prefix: str = 'directory_', max_retries: int = 3)
Abstract Class which defines the basic functionality of a file streamer
- class cmoncrawl.processor.pipeline.streamer.IStreamer
Base class for all outstreamers, it streams the data out and returns identifier for the data if successful, otherwise it returns None
- class cmoncrawl.processor.pipeline.streamer.MemoryStreamer
Memory Streamer which keeps the output is memory
- class cmoncrawl.processor.pipeline.streamer.StreamerFileHTML(root: Path, max_directory_size: int)
- class cmoncrawl.processor.pipeline.streamer.StreamerFileJSON(root: Path, max_directory_size: int, max_file_size: int, pretty: bool = False)