I got a file written in a kind of metalanguage, which describes the procedure which is needed to validate some data. I need to generate validation functions to validate the data. Data are already stored in a structure
Steps I made:
- Split text into string[], using char like(' . , ; == >= )
- Remove articles, prepositions...
- Normalize text(how?)
- Match words with tokens using Regex or text matching
- Match patern using Token type
- Generate functions based on the matched pattern rule
What would you use in step 3 or in general to improve this procedure?
As quoted from wiki, regex is one of the techniques to achieve "Text normalization":
It seems to me that the data involves linguistic annotations. You can check out tools like The IMS Open Corpus Workbench (CWB). Also, there is another site (with sample code) that you may find useful: What Is Text Normalization?.