Domain Record

By domain record we refer to a strucuture that cotains the information about how to download a crawl of an url. It contains the following

url: the url to crawl
filename: the warc filename
offset: the offset in the warc file
length: the length of the html crawl
digest [optional]: the digest of the html crawl
encoding [optional]: the encoding of the html crawl
timestamp [optional]: the timestamp of the crawl

Domain Record JSONL format

In order to use your own domain records with extract mode of cli, you must format them into follwoing json format

{
    "domain_record":
    {
        "url": "http://example.com",
        "filename": "crawl.warc.gz",
        "offset": 123,
        "length": 456,
        "digest": "sha1:1234567890abcdef",
        "encoding": "utf-8",
        "timestamp": "2018-01-01T00:00:00Z"
    },
    "additional_info":
    {
        "key1": "value1",
        "key2": "value2"
    }
}

Each such json must be on a separate line in a file. You don’t have to provide all the fields, only url, filename, offset and length are required. The Athena SQL keys are: u.url, cc.warc_filename, cc.warc_record_offset, cc.warc_record_length, cc.content_digest, cc.fetch_time

The additional_info field is optional and can contain any additional information. It will be added to extracted fields as is. It’s usefull when you for example want to add to which set the url belongs to.