I have a document image, which might be a newspaper or magazine. For example, a scanned newspaper. I want to remove all/most text and keep images in the document. Anyone know how to detect text region in the document? Below is an example. Thanks in advance!
example image: https://www.mathworks.com/matlabcentral/answers/uploaded_files/21044/6ce011abjw1elr8moiof7j20jg0w9jyt.jpg
The usual pattern of object recognition will work here - threshold, detect regions, filter regions, then do what you need with the remaining regions.
Thresholding is easy here. The background is pure white (or can be filtered to be pure white) so anything that is above 0 in the inverted grayscale image is either text or an image. Then regions can be detected within this thresholded binary image.
For filtering the regions, we just have to identify what makes the text different from the pictures. Text regions are going to be small since every letter is its own region. Pictures are big regions in comparison. Filtering by region area with the proper threshold will pull out all of the pictures and remove all of the text, assuming none of the pictures are about the size of a single letter anywhere on the page. If they are then other filtering criteria can be used (saturation, hue variance, ...).
Once the regions are filtered by the area and saturation criteria then a new image can be created by inserting the pixels in the original image that fall within the bounding boxes of the filtered regions into a new image.
MATLAB implementation:
As you can see in the image linked below, it does what is needed. All but the pictures and the masthead are removed. The good thing is that this will work just fine with colored and grayscale images if you're working with newspapers away from the front page.
Results:
https://i.stack.imgur.com/VtWjU.png