Understanding the "content type" for PDFs in crawling output

242 views Asked by At

Using heritrix, I have crawled a site which contained some PDF files. The crawl log shows that the content type for the pdf link is "application/pdf", whereas the response in .warc file (crawl output) shows that the content type is "application/http" as well as "application/pdf" (see the example below:).

WARC/1.0^M
WARC-Type: response^M
WARC-Target-URI: `http://example.com/b/c/files/abc.pdf`^M
WARC-Date: 2014-05-29T10:48:03Z^M
WARC-Payload-Digest: sha1:JMRPMGSNIPHBPSBNPD2VJ2NIOGD75UUK^M
WARC-IP-Address: 86.36.67.50^M
WARC-Record-ID: <urn:uuid:00c8b80f-2851-42a1-a449-3cd9e238bfe9>^M
**Content-Type: application/http; msgtype=response^M**
Content-Length: 592173^M
WARC-Block-Digest: sha256:0a56d251257dbcbd6a54e19a528a56aae3e0c9e92a6702f4048e3b69bb3e0920^M
^M
HTTP/1.1 200 OK^M
Date: Thu, 29 May 2014 10:48:04 GMT^M
Server: Apache/2.4.4 (Unix) OpenSSL/0.9.7d PHP/5.3.12 mod_jk/1.2.35^M
Last-Modified: Wed, 20 Nov 2013 08:13:50 GMT^M
ETag: "90805-4eb975c6bcb80"^M
Accept-Ranges: bytes^M
Content-Length: 591877^M
Connection: close^M
**Content-Type: application/pdf^M** 
followed by the content of the PDF file

I do not understand how this is happening. Can anyone please explain?

1

There are 1 answers

0
Nytux On

The WARC file contains:

First the WARC-Header-Metadata, from the beginning to the first empty line. This header describes what follows, ie. a full http response, with header and content. Hence the content-type to application/http.

Then comes the HTTP-Response-Metadata. This header is the actual HTTP header and describes what follows, ie. a PDF document.