The question seems to be weird, but I need to ask this, since I am witnessing a quite interesting output when I compare text as image and graphics as image.
Ideally I am in process of identifying an tool, or algorithm to compare two pdfs, generate output which will highlight the difference between them.
There are possibilities in pdfs, which will have text as image format (legacy text on papers, are converted to pdfs).
and we are doing migration of those legacy pdfs, and finally we are comparing with legacy and converted pdf output.
I am evaluating couple of tools like Adobe dc pro, i-net pdfc and power pdf etc, for comparing two pdfs.
While evaluating, I am able to see graphic images are getting compared(not accurate either) on either side of the pdfs. Where as text as images are completely ignored, unanimously same results in all the tools.
But I am more interested in text as image, since we deal more of legacy text pdfs.
Below, is attached graphic image comparison result, where it could able to capture the differences between the images.
But when I compare text image, differences are not highlighted in the tool.
What I understand from this, text is not compared as image graphics, and tool is completely ignoring the comparison. I would like have clarification whether my assumption is correct.
Secondly, I would like to know how to compare text image in pdfs to generate the differences?.
I'm working for the company that is author of i-net PDFC so I'll answer your first question as well:
Your assumption is correct. i-net PDFC is able to compare images and shapes, but it cannot detect if some content completely changed it's meaning, e.G. a line shape that is used to draw a letter or in your case an image that has to be recognized as text. Recognizing ASCII art as image won't work for the same reason either. Such cases will always be detected as differences even though their visual appearance is similar.
On your second question: Using an OCR conversion tool for one or both documents is a common solution to this problem. A simple image comparison of the compared pages in unlikely to work due to the different font styles and line wrappings in the converted file. Please note that most OCR applications will use the rendered page images for the recognition. This may lead to incorrect recognition results even if there are no images in the PDF file.
i-net Software is aware of this general issue and an OCR module is currently in development. It'll provide an option to apply the recognition solely to the images in the PDF files.