How to extract from Common Crawl (theory)
The process of getting one parsed web page from CommonCrawl can be described as a pipeline.
Query CommmonCrawl to find a link to a file that contains the web page we want.
Download a file
Choose parser for the web page
Filter out the web page if not matching the conditions
Extract fields from the page
Save the fields to a file
The first step is handled by Aggregator while the rest is handled by Processor.
1. Querying CommonCrawl
- what WARC File how
WARC is a file format that is used for storing multitudes of web resources. In our case these files contain a bunch of downloaded web pages and their metadata. It’s possible to get only part of the file by specifying the offset in file and length of the part we want.
- what
Common Crawl Index
- how
A CommonCrawl index is a collection which maps crawled urls to WARC file which contain the crawl of that url.
Every month a CommonCrawl releases a new index which contains all links to web pages that were crawled that month.
Warning
It is important to understand that even if the index was released in a certain month, it can contain the links to web pages that might be older.
Thus in order to download an page we query the index to get link to respective WARC file, offset and length of page. Since there are multiples of the indexes we should query all of them to make sure we don’t miss the page. With the link to the WARC and offset and length we can continue to another step.
All this is handled by cmoncrawl.aggregator.gateway_query.GatewayAggregator
. But for basic use you will not need to use it directly.
2. Downloading a file
The Processor node than downloads the url and related information from queue and downloads the appropriate WARC file.
This step is handled by cmoncrawl.processor.pipeline.downloader.AsyncDownloader
.
It simply downloads and extracts the page from the WARC file. For downloading we use two data access objects (DAOs, cmoncrawl.processor.dao.base.ICC_Dao
):
cmoncrawl.processor.dao.s3
, which downloads the file from AWS S3 directlycmoncrawl.processor.dao.api
, which downloads the file through CommonCrawl API Gateway.
3. Choose extractor
Once the page is downloaded we first need to choose a extractor for it.
Extractors are dynamically loaded based on definitions in Extractor config file.
All loaded processors are then matched against the url and crawl date and first matching is used.
This functionality is handled by cmoncrawl.processor.pipeline.router.Router
.
For development of extractors refer to Extractor types.
4. Filtering out the web page
Once the extractor is chosen the filtering function is used to either drop or pass a page.
In order to filter your you can use either cmoncrawl.processor.pipeline.extractor.BaseExtractor.filter_raw()
for
filtering based on raw html pages (fast). Or wait for conversion to soup and then filter using
cmoncrawl.processor.pipeline.extractor.BaseExtractor.filter_soup()
(slow).
5. Extract fields from the page
The extracting function defined by the extractor is used to extract the fields from the page. Just extract the values and return them in dict.
6. File saving
With the field extracted we need to save them to a file.
By default the fields are saved in json file.
The way the file is saved is defined by streamers.
All of the currently implemented streamers are derived from cmoncrawl.processor.pipeline.streamer.BaseStreamerFile
.
Which defined how are the files saved, but the content parsing is left to the derived classes.
Currently we support 2 streamers:
JSON (
cmoncrawl.processor.pipeline.streamer.StreamerFileJSON
) and one for html (cmoncrawl.processor.pipeline.streamer.StreamerFileHTML
), which creates a json per line output, and outputs all extracted dataHTML (
cmoncrawl.processor.pipeline.streamer.StreamerFileHTML
), which creates a html file (assuming the html is defined in extracted data[‘html’]).
If you want to debug you might want to use cmoncrawl.processor.pipeline.streamer.MemoryStreamer
which outputs the data to memory instead of file.
If you would like different format you can create your own saver by inheriting from cmoncrawl.processor.pipeline.streamer.IStreamer
and then changing pipeline creation with your new outstreamer.