Extract each sentence from a PDF file to a separate cell in Excel?

1.1k views Asked by At

As the title suggests I have a file which needs to have each sentence extracted to a cell in Excel, one per cell.

The sentence extraction can be as simple as find the next ". " and extract to a cell. The problem is I don't really know any programming language aside from MATLAB (I'm a mechanical engineer).

If it can ignore tables/pictures that's awesome, if not it's fine so long as it doesn't screw up when it encounters a table/picture. I know I'm not giving you a lot to work on but any help is appreciated.

1

There are 1 answers

0
Kurt Pfeifle On

You didn't tell, how you want your "sentence-cells" be layouted...

  1. Short answer: This is not possible.

  2. Extended answer: This is quite difficult, and it also depends on your specific PDF file. Some PDFs do not lend themselves at all to text extraction.

  3. You could try the following command, which attempts to capture each sentence into a field of a CSV-type table (with only one column, and a number of rows equivalent to the number of total sentences:

    pdftotext -layout -x 10 -y 20 -W 400 -H 490 the.pdf - \
      | tr "\\n" " "            \
      | perl -pe 's#\f# #g'     \
      | perl -pe 's#\. #.\n#g'  \
      | perl -pe 's#\? #?\n#g'  \
      | perl -pe 's#\! #!\n#g'  \
      | sed 's#^#"#'            \
      | sed 's#$#",#'           \
      | tee myvalues.csv
    

    This example works with a sample 2-page PDF which I created to quick-test my above command. Screenshot of PDF:

    Screenshot of 2-page PDF

    Above command works on Linux and Mac OS X. (Sorry, no time to come up with an equivalent Windows version!)

    To understand how (and IF) this command works for your PDF, go forward step by step:

    • Execute the first line on its own as a first attempt (get rid of the final \ sign which is a line continuation marker only). This first line will extract the text from the PDF only and print it on the standard output channel. If this doesn't work, all the other lines won't either. The -x .. -y .. -W .. -H .. parameters try to get rid of page footers and headers (like in the example PDF, which has page numbering), by selecting a top left rectangle coordinate corner (x and y), and a page area width (W) and height (H) to limit text extraction on exactly that area.

    • Execute the first two lines in a second attempt (keep the line continuation marker on the first line, get rid of the marker on the second). The second line takes the output from the first and replaces each newline character by a space character. Hence, you'll have all contents of a page on a single line.

    • Execute the first three lines in a third attempt (keep the line continuation markers on the first and seconod line, get rid of the marker on the third). The third line takes the output from the first two lines and replaces each formfeed character by a space character. These formfeed characters may occur in the original output when page breaks occur, sometimes within a sentence. (Alternatively, you could add -nopgbrk to the original pdftotext command to avoid the insertion of pagebreaks altogether.) Hence, you'll have all contents of all pages on a single line.

    • Lastly, execute all lines as they are given above. The fourth line replaces all occurrences of . (a colon followed by a space) by a newline character. The fifth and sixth lines break sentences concluded by question and exclamation marks. The seventh and eighth lines wrap the lines into quotes and conclude each line with a comma. The last line pipes the result into a file, myvalues.csv

    This is how the output will look:

    "this is a paragraph.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a paragraph.",
    "this is a sentence.",
    "this is a sentence.",
    "this is a sentence.",
    [....]
    "this is a sentence.",
    

If the command works as intended for you, it will return a CSV (comma separated values) text file. This type of text files can easily be imported into Excel.