Fighting with OCR reverse engineering

2.4k views Asked by At

I am referring to software based OCR ?Image to text engine conversion tools, stackoverflow has tons of posting on building OCR but I am looking opposite, like any guidance on how to protect my images from reverse engineering.

For example i have images containing only texts, how can i make it difficult for anyone to decode the data, is there any desired image format which can do this? or we can obfuscate images?

Can using special fonts or distortion guarantee OCR protection? though my requirement do not allow too much of distorted text being served.

Any direction will be very helpful

4

There are 4 answers

2
Hank On

As I and others have said, making a large amount of text obscure enough that OCR can't read it will make it impractical for humans.

Is there a specific threat you're trying to beat? Simple web crawlers often don't execute javascript, so a dumb way to make your text harder to scrape would be to load it with an AJAX request and insert it into the DOM.

Or if you want to get more intense, you could have the text displayed in a Flash or Silverlight control -- still not OCR-proof, but that would make it non-trivial to automatically grab large amounts of text, particularly if you have a Flash scrollbar and/or pagination. (I should point out that Flash controls for something simple like text sounds annoying to use, won't be searchable or bookmarkable, and obviously won't work on the majority of mobile devices.)

1
Tomato On

As I understand, you have a collection of some copyrighted text that should be clearly readable by humans, but you don't want it to leak from your server in electronic form. I don't think that it's a good idea to obfuscate text making it harder to OCR, since it will make it unreadable by humans, especially if texts are really long. Basically, what is easy to read for humans, can be perfectly OCR-ed. What is difficult to OCR is difficult for people too. In worst case, attacker may hire an Indian company to do manual retyping of text, this is not that expensive actually.

I would offer you to look for other aspects to make good protection. How does your use case look like? How come that users can get your texts as images on their PC? Do they download it just as PDF or image files? In this case it would be much simpler to fight against possibility to DOWNLOAD your files, instead of making it unreadable.

For example, you may think about not giving access to the whole file at once, but showing it page by page with human interaction required to get to the next page. You may even scramble your web interface to make it not possible to download everything by typical site download utilities. Each page shold be displayed on same URL, but actual navigation should be communicating with the server with AJAX or even some proprietary interface.

Another way is to make a lot of false links on every page not visible by humans, but they will mislead download utilities making them download tons of wrong content, or download it in wrong order making it unusable.

And if you will be successful in fighting against automated download, you won't even have to provide your content as an image, it can be straight text, but just small piece of it. It anyway will be unusable.

Hope that gives you some idea which way to go.

0
Hjulle On

I have seen some pages obfuscating text by using invisible letters and other "noise" in the text. This way you can still display it as text, while making it a lot harder to copy.

Another idea might be to watermark the text in some way to recognize from where a "stolen" copy came from. If this is useful depends on exactly what you want to be protected from. As has already been mentioned, if it is readable, someone could manually copy it.

3
starmole On

I do not think you can do that. For CAPTCHA, yes, and there is tons of research, but you will also know from personal experience how annoying they are to read. For longer text it is impossible. I would seriously question the use case or business model here though. You have some content that for some reason needs protection from OCR. That means somebody would be willing to spend resources to OCR your content. Why would you fight those people? Make them a customer and offer the content in plain text for some fee. If that fee is less than their OCR cost, you have a win-win. What you are trying to implement sounds like a lose-lose.