I am attempting to take a pdf file and grab the text from it.
I found iText and have been using it and have had decent success. The one problem I have remaining are ligatures.
At first I noticed that I was simply missing characters. After doing some searches I came across this: http://support.itextpdf.com/node/25
Once I knew that it was ligatures I was missing, I began to search for ways to solve the problem and haven't been able to come up with a solution yet.
Here is my code:
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.FilteredTextRenderListener;
import java.io.File;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.BufferedWriter;
import java.io.IOException;
import java.util.Formatter;
import java.lang.StringBuilder;
public class ReadPdf {
private static String INPUTFILE = "F:/Users/jmack/Webwork/Redglue_PDF/live/ADP/APR/ADP_41.pdf";
public static void writeTextFile(String fileName, String s) {
// s = s.replaceAll("\u0063\u006B", "just a test");
s = s.replaceAll("\uFB00", "ff");
s = s.replaceAll("\uFB01", "fi");
s = s.replaceAll("\uFB02", "fl");
s = s.replaceAll("\uFB03", "ffi");
s = s.replaceAll("\uFB04", "ffl");
s = s.replaceAll("\uFB05", "ft");
s = s.replaceAll("\uFB06", "st");
s = s.replaceAll("\u0132", "IJ");
s = s.replaceAll("\u0133", "ij");
FileWriter output = null;
try {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8"));
writer.write(s);
writer.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (output != null) {
try {
output.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
String str = PdfTextExtractor.getTextFromPage(reader, 1, new SimpleTextExtractionStrategy());
writeTextFile("F:/Users/jmack/Webwork/Redglue_PDF/live/itext/read_test.txt", str);
}
catch (Exception e) {
System.out.println(e);
}
}
}
In the PDF referenced above one line reads:
part of its design difference is a roofline
But when I run the Java class above the text output contains:
part of its design diference is a roofine
Notice that difference became diference and roofline became roofine.
It is interesting to note that when I copy and paste from the PDF to stackoverflow's textfield, it also looks like the second sentence with the two ligatures "ff" and "fl" reduced to simply "f"s.
I am hoping that someone here can help me figure out how to catch the ligatures and perhaps replaces them with the characters they represent, as in the ligature "fl" being replaced with an actual "f" and a "l".
I ran some tests on the output from the PDFTextExtractor and attempted to replace the ligature unicode characters with the actual characters, but discovered that the unicode characters for those ligatures do not exist in the value it returns.
It seems that it must be something in iText itself that is not reading those ligatures correctly. I am hopeful that someone knows how to work around that.
Thank you for any help you can give!
TLDR: Converting PDF to text with iText, had missing characters, discovered they were ligatures, now I need to capture those ligatures, not sure how to go about doing that.