I am using PdfBox's .net to parse to extract text from a pdf alongwith text location.For that, while searching I found the following java code:
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
super.writeString(text, textPositions);
TextPosition firstProsition = textPositions.get(0);
TextPosition lastPosition = textPositions.get(textPositions.size() - 1);
writeString(String.format("[%s - %s / %s]", firstProsition.getXDirAdj(), lastPosition.getXDirAdj() + lastPosition.getWidthDirAdj(), firstProsition.getYDirAdj()));
}
};
stripper.setSortByPosition(true);
return stripper.getText(document);
I converted it to .net in the following way:
class PDFTextLocationStripper : PDFTextStripper
{
public string textWithPostion = "";
protected override void processTextPosition(TextPosition text)
{
textWithPostion += "String[" + text.getXDirAdj() + "," +
text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
text.getXScale() + " height=" + text.getHeightDir() + " space=" +
text.getWidthOfSpace() + " width=" +
text.getWidthDirAdj() + "]" + text.getCharacter();
}
protected override void writeString(java.lang.String text, java.util.List textPositions)
{
base.writeString(text, textPositions);
TextPosition firstProsition = (TextPosition)textPositions.get(0);
TextPosition lastPosition =(TextPosition) textPositions.get(textPositions.size() - 1);
writeString(String.Format("[%s - %s / %s]", firstProsition.getXDirAdj(), lastPosition.getXDirAdj() + lastPosition.getWidthDirAdj(), firstProsition.getYDirAdj()));
}
}
But, I get compilation error for the above code that :
Error 1 No overload for method 'writeString' takes 2 arguments
Error 2 'PDFTextLocationStripper.writeString(java.lang.String, java.util.List)': no suitable method found to override
So ,how do I override writeString method so that I can extract text along with location?
Since, I wasn't able to overload the writeString method.I used the processTextPosition to extract words from a pdf along with their positions.Here is the code:
And here is the PdfWord class.