Domain Record

By domain record we refer to a strucuture that cotains the information about how to download a crawl of an url. It contains the following

  • url: the url to crawl

  • filename: the warc filename

  • offset: the offset in the warc file

  • length: the length of the html crawl

  • digest [optional]: the digest of the html crawl

  • encoding [optional]: the encoding of the html crawl

  • timestamp [optional]: the timestamp of the crawl

Domain Record JSONL format

In order to use your own domain records with extract mode of cli, you must format them into follwoing json format

{
    "domain_record":
    {
        "url": "http://example.com",
        "filename": "crawl.warc.gz",
        "offset": 123,
        "length": 456,
        "digest": "sha1:1234567890abcdef",
        "encoding": "utf-8",
        "timestamp": "2018-01-01T00:00:00Z"
    },
    "additional_info":
    {
        "key1": "value1",
        "key2": "value2"
    }
}

Each such json must be on a separate line in a file. You don’t have to provide all the fields, only url, filename, offset and length are required. The Athena SQL keys are: u.url, cc.warc_filename, cc.warc_record_offset, cc.warc_record_length, cc.content_digest, cc.fetch_time

The additional_info field is optional and can contain any additional information. It will be added to extracted fields as is. It’s usefull when you for example want to add to which set the url belongs to.