Splitting complex PDF files using Watson Document Conversion Service

717 views Asked by At

We are implementing Question & Answering System using Watson Discovery Service(WDS). We required each answer unit available in single document. We have complex PDF files as corpus. The PDF files contains two column data, tables and images. Instead ingesting whole PDF files as corpus to WDS and using passage retrieval we are using Watson Document Conversion Service(WDC) to split each PDF file into answer units and later we are ingesting there answer units into WDS.

We are facing two issues with Watson Document Conversion service for complex PDF splitting.

  1. We are expecting each heading as title and corresponding text as data(answer). However it is splitting each chapter as single answer unit. Is there any way to split the two column document based on the heading?
  2. In case the input PDF file contains table the document conversion service reading structured data available in PDF file as simple text(missing table formatting). Is there any way to read structured data from PDF to answer unit?
1

There are 1 answers

1
Anton Prevosti On

I would recommend that you first convert your PDF to normalized HTML by using this setting:

   "conversion_target": "normalized_html"

and inspect the generated HTML. Look for the places where headings (<h1>, <h2>, ..., <h6>) are detected. Those are the tags that will be used to split by answer units when you switch back to answer_units. The reason you are currently seeing each chapter being split as an answer unit is because each chapter probably starts with a heading, but no headings are detected within each chapter.

In order to generate more answer units, you will need to tweak the PDF input configurations as described here, so that more headings are generated from the PDF to HTML conversion step and hence more answer units are generated.

For example, the following configuration will detect headings at 6 different levels, based on certain font characteristics for each level:

{
  "conversion_target": "normalized_html",
  "pdf": {
    "heading": {
      "fonts": [
        {"level": 1, "min_size": 24},
        {"level": 2, "min_size": 18, "max_size": 23, "bold": true},
        {"level": 3, "min_size": 14, "max_size": 17, "italic": false},
        {"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
        {"level": 5, "min_size": 10, "max_size": 12, "bold": true},
        {"level": 6, "min_size": 9, "max_size": 10, "bold": true}
      ]
    }
  }
}

You can start with a configuration like this and keep tweaking it until the produced normalized HTML contains the headings at the places that you expect the answer units to be. Then, take the tweaked configuration, switch to answer_units and put it all together:

{
  "conversion_target": "answer_units",
  "answer_units": {
    "selector_tags": ["h1", "h2", "h3", "h4", "h5", "h6"]
  },
  "pdf": {
    "heading": {
      "fonts": [
        {"level": 1, "min_size": 24},
        {"level": 2, "min_size": 18, "max_size": 23, "bold": true},
        {"level": 3, "min_size": 14, "max_size": 17, "italic": false},
        {"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
        {"level": 5, "min_size": 10, "max_size": 12, "bold": true},
        {"level": 6, "min_size": 9, "max_size": 10, "bold": true}
      ]
    }
  }
}

Regarding your second question about tables, unfortunately there is no way to convert table content into answer units. As explained above, answer unit generation is based on heading detection. That being said, if there is a table between two detected headings, that table will be part of the answer unit as any other content between the two headings.