I need to break a line of string into different columns into excel. Here is te input that i get.
Input:
- 37006 II Semester P.G. Diploma in Clinical Research and Clinical Data Management Examination, July/August 2012 Pharma Regulatory Affairs Time : 3 Hours Max. Marks : 100
Output: CSV record with structure (Code, Sem/Year, Subject, Course, Exam Date, Time, Marks)
- 37006 , II Semester, P.G. Diploma in Clinical Research and Clinical Data Management, Pharma Regulatory Affairs, July/August 2012 , 3 Hours , 100
I have data in different sets which constructs above lines. For example:
Grammar (this is an array / dictionary):
- Semesters[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
- Years[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
- Subjects[P.G. Diploma in Clinical Research and Clinical Data Management, LL.B]
- Courses[Pharma Regulatory Affairs,Law - Jurisprudence]
- ExamDates[ July/August 2012 , Jan./Feb. 2013 ]
- Time[3 Hours]
- MaxMarks[30,40,50,60,70,80,90,100]
FYI,
- I'm not sure that i can use any delimiters to break it as its highly unpredictable or dependable.
- I'm not sure the text will be in same order in each line or no fixed length or cars or words
My assumption is, read word by word and try to match with any word in any array that I have. If its match with any word, then categorize that word into falling category and add into relevant column in excel.
Here, I know how to handle data and everything, except what is the optimized / best way to understand each word falls under which category.
Is there any lexical analysis expert that can share some thoughts on this?
You should use regular expressions for matching such complicated text pattern.