How to detect text region from a document image?

1.8k views Asked by At

I have a document image, which might be a newspaper or magazine. For example, a scanned newspaper. I want to remove all/most text and keep images in the document. Anyone know how to detect text region in the document? Below is an example. Thanks in advance!

example image: https://www.mathworks.com/matlabcentral/answers/uploaded_files/21044/6ce011abjw1elr8moiof7j20jg0w9jyt.jpg

1

There are 1 answers

4
Staus On BEST ANSWER

The usual pattern of object recognition will work here - threshold, detect regions, filter regions, then do what you need with the remaining regions.

Thresholding is easy here. The background is pure white (or can be filtered to be pure white) so anything that is above 0 in the inverted grayscale image is either text or an image. Then regions can be detected within this thresholded binary image.

For filtering the regions, we just have to identify what makes the text different from the pictures. Text regions are going to be small since every letter is its own region. Pictures are big regions in comparison. Filtering by region area with the proper threshold will pull out all of the pictures and remove all of the text, assuming none of the pictures are about the size of a single letter anywhere on the page. If they are then other filtering criteria can be used (saturation, hue variance, ...).

Once the regions are filtered by the area and saturation criteria then a new image can be created by inserting the pixels in the original image that fall within the bounding boxes of the filtered regions into a new image.

MATLAB implementation:

%%%%%%%%%%%%
% Set these values depending on your input image

img = imread('https://www.mathworks.com/matlabcentral/answers/uploaded_files/21044/6ce011abjw1elr8moiof7j20jg0w9jyt.jpg');

MinArea = 2000; % Minimum area to consider, in pixels
%%%%%%%%%
% End User inputs

gsImg = 255 - rgb2gray(img); % convert to grayscale (and invert 'cause that's how I think)
threshImg = gsImg > graythresh(gsImg)*max(gsImg(:)); % Threshold automatically

% Detect regions, using the saturation in place of 'intensity'
regs = regionprops(threshImg, 'BoundingBox', 'Area');

% Process regions to conform to area and saturation thresholds
regKeep = false(length(regs), 1);
for k = 1:length(regs)

    regKeep(k) = (regs(k).Area > MinArea);

end

regs(~regKeep) = []; % Delete those regions that don't pass qualifications for image

% Make a new blank image to hold the passed regions
newImg = 255*ones(size(img), 'uint8');

for k = 1:length(regs)

    boxHere = regs(k).BoundingBox; % Pull out bounding box for current region
    boxHere([1 2]) = floor(boxHere([1 2])); % Round starting points down to next integer
    boxHere([3 4]) = ceil(boxHere([3 4])); % Round ranges up to next integer
    % Insert pixels within bounding box from original image into the new
    % image
    newImg(boxHere(2):(boxHere(2)+boxHere(4)), ...
        boxHere(1):(boxHere(1)+boxHere(3)), :) = img(boxHere(2):(boxHere(2)+boxHere(4)), ...
        boxHere(1):(boxHere(1)+boxHere(3)), :);

end

% Display
figure()
image(newImg);

As you can see in the image linked below, it does what is needed. All but the pictures and the masthead are removed. The good thing is that this will work just fine with colored and grayscale images if you're working with newspapers away from the front page.

Results:

https://i.stack.imgur.com/VtWjU.png