cmoncrawl.processor.pipeline.streamer

Classes

class cmoncrawl.processor.pipeline.streamer.BaseStreamerFile(root: Path, max_directory_size: int, max_file_size: int, extension: str, directory_prefix: str = 'directory_', max_retries: int = 3)

Abstract Class which defines the basic functionality of a file streamer

class cmoncrawl.processor.pipeline.streamer.IStreamer

Base class for all outstreamers, it streams the data out and returns identifier for the data if successful, otherwise it returns None

class cmoncrawl.processor.pipeline.streamer.MemoryStreamer

Memory Streamer which keeps the output is memory

class cmoncrawl.processor.pipeline.streamer.StreamerFileHTML(root: Path, max_directory_size: int)
class cmoncrawl.processor.pipeline.streamer.StreamerFileJSON(root: Path, max_directory_size: int, max_file_size: int, pretty: bool = False)