I have been trying to segment lines from a printed text document. I have followed the following paper:
A Hough Transform based Technique for Text Segmentation Satadal Saha, Subhadip Basu, Mita Nasipuri and Dipak Kr. Basu
As per the paper, I used Hough transform to generate straight lines over the text and restricting angles in the vicinity of 90 degrees and connected component algorithm to group the generated straight lines to separate out lines from the text.
The hough transform output is given below:
But, the straight lines generated sometimes overlap between two text lines and more than one line segment gets grouped together.
The bounding boxes of lines in the text is given below:
Can anybody please help me to avoid this grouping together of lines of text? Please suggest a method so that the connected component analysis treat the lines of text as separate components.
You are using connected components to group your hough-lines into text-lines. This process is very sensitive to noise: even one misdetected pixel can bring together two lines.
You can make this process more robust if you look at the average "on" pixels per line in the image:
The red line shows 7.5% threshold on number of "on" pixels per row. As you can see it can help distinguish between well connected hough-lines to falsely connected ones.
Use this threshold to amend the mask:
Now you can get the proper bounding boxes
Resulting with: