Grouping together of lines while doing line segmentation of printed text

497 views Asked by At

I have been trying to segment lines from a printed text document. I have followed the following paper:

A Hough Transform based Technique for Text Segmentation Satadal Saha, Subhadip Basu, Mita Nasipuri and Dipak Kr. Basu

As per the paper, I used Hough transform to generate straight lines over the text and restricting angles in the vicinity of 90 degrees and connected component algorithm to group the generated straight lines to separate out lines from the text.

The hough transform output is given below:

Hough transform output

But, the straight lines generated sometimes overlap between two text lines and more than one line segment gets grouped together.

The bounding boxes of lines in the text is given below:

bounding box image of line segments of text

Can anybody please help me to avoid this grouping together of lines of text? Please suggest a method so that the connected component analysis treat the lines of text as separate components.

1

There are 1 answers

2
Shai On BEST ANSWER

You are using connected components to group your hough-lines into text-lines. This process is very sensitive to noise: even one misdetected pixel can bring together two lines.
You can make this process more robust if you look at the average "on" pixels per line in the image:

bw = imread('https://i.stack.imgur.com/tg2xN.png');
bw=bw>100;
figure; plot( mean(bw,2) ); xlabel('image row'); ylabel('#"on" pixels');

enter image description here

The red line shows 7.5% threshold on number of "on" pixels per row. As you can see it can help distinguish between well connected hough-lines to falsely connected ones.
Use this threshold to amend the mask:

msk = bsxfun(@times, bw, mean(bw,2)>0.075);

Now you can get the proper bounding boxes

bb=regionprops(bwlabel(msk,8),'BoundingBox');

Resulting with:

enter image description here