Python 2.6
Recently I am doing some text mining works with resumes. The objective is to divide the resume into several sections based on its headings and contents and then classify it for required jds. Eg. We know that a resume generally contains these sections :
1) Personal Information
2) Summary
3) Technical Skills
4) Earlier Projects and Experience
5) Education.
Now all I want is to build a database where I have contents of resume under each category for all the resumes.
The structure is something like this :
Personal Information Summary Technical Skills Experience/Projects Education
Resume 1 Relevant info Relevant info Relevant info Relevant info Relevant info
Resume 2 " " " " "
Resume 3 " " " " "
The relevant info shall be the contents under those specific sections in the resumes.
I have done some researches and finally my problem boils down to identify the section names. Idea is to Find where a section name begins and where the next section name begins , so that the text in this interval comes under the first section name. Here in lies the problem.
Problem: Suppose in resume 1 we have section names "Technical skills" and "Experience" . We take the data in between the two and put under the Technical skills column of Resume1. But when we look at Resume2 we find the same section names are named "Expertise in Software" and "Earlier works & Project Profiles" and we cannot extract the names by the keywords which we used earlier. So, if I have to extract the sections for different cv Each time I have to do it by different section names for which I cannot generalize my code.
I have tried using dictionary of similar words i.e. the synonyms for words like "Soft Skills" are "Technical Expertise , Software Expertise, Technical Knowledge etc. " similar are "Academics" , "Educational Qualification","Education" , same for Experience , Projects and other sections too . But this list is not exhaustive there can be some other words for these section names , because people can write anything in their cv. Under one section too there can be subsections with different names.
Generally a section name ends with a colon or a semi-colon we can also find by that .
These are just the approaches but nothing concrete to build the database I want to. Now most of the resumes are in PDF format , so I convert them to text first and then read those. So the section names which are sometimes in a larger font or maybe different than the rest of the resume become the same font as the rest , so no way to identify them by these criteria.
These are the problems I am facing, if I can have a general algorithm to choose the section names then it would ease my work a lot. I know this is a forum for coding problems and it has been a great help to me since I have started my career, but I am posting this here if anyone can give me any insight on how to proceed. I am coding in Python , also any suggestion on some other languages like R or SAS would help. Mostly a general algorithm to choose the section names will do best for me. Please help if you have some idea.By tagging conditional random field. Thanks a in advance.
PS: I have already tried NER approach and converting all formats to html to extract headers but all efforts were fruitless...