I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.
The following code I already have:
Main
public static void main(String[] args) throws IOException {
String src = "SEM_081145.pdf";
PdfReader reader = new PdfReader(src);
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
}
out.flush();
out.close();
}
And I have implemented the TextExtraction Strategy SemTextExtractionStrategy
which looks like this:
public class SemTextExtractionStrategy implements TextExtractionStrategy {
private String text;
@Override
public void beginTextBlock() {
}
@Override
public void renderText(TextRenderInfo renderInfo) {
text = renderInfo.getText();
System.out.println(renderInfo.getFont().getFontType());
System.out.print(text);
}
@Override
public void endTextBlock() {
}
@Override
public void renderImage(ImageRenderInfo renderInfo) {
}
@Override
public String getResultantText() {
return text;
}
}
I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?
Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.
You can adapt the code provided in this answer, in particular this code snippet:
This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.