How to break line into known words

154 views Asked by At

I need to break a line of string into different columns into excel. Here is te input that i get.

Input:

  • 37006 II Semester P.G. Diploma in Clinical Research and Clinical Data Management Examination, July/August 2012 Pharma Regulatory Affairs Time : 3 Hours Max. Marks : 100

Output: CSV record with structure (Code, Sem/Year, Subject, Course, Exam Date, Time, Marks)

  • 37006 , II Semester, P.G. Diploma in Clinical Research and Clinical Data Management, Pharma Regulatory Affairs, July/August 2012 , 3 Hours , 100

I have data in different sets which constructs above lines. For example:

Grammar (this is an array / dictionary):

  • Semesters[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
  • Years[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
  • Subjects[P.G. Diploma in Clinical Research and Clinical Data Management, LL.B]
  • Courses[Pharma Regulatory Affairs,Law - Jurisprudence]
  • ExamDates[ July/August 2012 , Jan./Feb. 2013 ]
  • Time[3 Hours]
  • MaxMarks[30,40,50,60,70,80,90,100]

FYI,

  • I'm not sure that i can use any delimiters to break it as its highly unpredictable or dependable.
  • I'm not sure the text will be in same order in each line or no fixed length or cars or words

My assumption is, read word by word and try to match with any word in any array that I have. If its match with any word, then categorize that word into falling category and add into relevant column in excel.

Here, I know how to handle data and everything, except what is the optimized / best way to understand each word falls under which category.

Is there any lexical analysis expert that can share some thoughts on this?

2

There are 2 answers

5
Xardas On BEST ANSWER

You should use regular expressions for matching such complicated text pattern.

0
rajah9 On

Please take a look at a lexical analyzer like ANTLR. If you know Java or other languages that read regular expressions, you will be able to parse these with ease after an afternoon (or week) of torture. You can also write the regexp in Java, but I would nudge you toward the ANTLR interface, which you may use from Eclipse. It will show you how the lines are being parsed.

Have the output of the ANTLR or Java write out a CSV file. The CSV will get be your vehicle for getting your data into the Excel spreadsheet.