I am trying to use borb to extract text from pdf's. Some pdfs works well but when trying to extract text from some pdf's I get extra spaces between all letters and spaces. It looks like:
I N B E T A L N I N G / G I R E R I N G A V
If I count spaces and notice that there are more than usual can I use regex in some way to remove one space everywhere ?
So that it looks like:
INBETALNING / GIRERING AV
Disclaimer: I am the author of
borb
A pdf document doesn't really contain text as is. It contains rendering instructions that a program like Adobe Reader will execute. These instructions yield something a human might interpret as text.
For instance:
You will notice that the space in "Hello World" is not explicitly in the rendering instructions. It could be. But doesn't need to be. And many pdf creation tools choose not to insert a space, but rather move the drawing cursor along.
Now what that means for text extraction is that software such as
borb
has to guess when to insert a space.It can tell how far apart the bounding boxes of two characters are.
Of course if the space character is not used in the rendering instructions, it might not be included in the font information. This is called font-subsetting. Where a specialised font is created, containing only the characters actually in use.
When this happens,
borb
doesn't know how wide a space character is supposed to be.borb
will try different heuristics:If you look in the code of
SimpleTextExtraction
you will be able to see this logic in action.I suggest you subclass that class, and modify it to allow you (the user) to define an acceptable space character width.
In particular have a look at this line.