What is the formal process of cleaning unstructured data

Question

639 views Asked by Karan Kothari At 21 December 2016 at 15:12

I needed help with a couple of things.. I am new to NLP and unstructured data cleaning.. can someone answer the following questions... Thanks

need help with regex to identify words like _male and female_ or more generic like _word and word_ or _something_something_something and get rid of the underscore that is present in the beginning or the end but not in the middle.
I wanted to know the formal process of cleaning the data, like are there any steps that we have to follow for cleaning unstructured data, im asking this because I am doing lemmatization (with POS) and replacing the commonly occurring words like (something, something) to something_something. so what steps should I follow? I am doing the following right now-tokenize_clean>remove_numbers>remove_url>remove_slash>remove_cross>remove_garbage>replace_hypen_with_underscore>lemmatize_sentence>change_words_to_bigrams>remove_smaller_than_3(words with len smaller then 3)>remove_simlutaneous( words that occurred simultaneously many times eg, death death death)>remove_location>remove_bullets>remove_stop>remove_simlutaneous

Should I do something different in these steps?

I also have words like (group'shealthplanbecauseeitheroneofthefollowingqualifyingeventshappens) , (whenyouuseanon_networkprovider) ,(per\xad) ,(vlfldq\x10vxshuylvhg) how should I handle them? ignore them completely or try to improve them?

My final goal is to classify the documents into Yes and No class. Any suggestions are welcomed.

Will provide more examples and explanation if required.

There are 1 answers

**Dmitry** · Answer 1 · 2016-12-21T15:58:00+00:00

Must the regular expression allows something __abc__? If not, (\b_[a-zA-Z]+\s)|(\s[a-zA-Z]+_\b)|(\s_[a-zA-Z]+_\b)
What problem do you solve? Do you prepare texts for classification etc.?
You have to distinguish mistakes and symbol sequences. There are some scientific ways to make this, for example comparison with corpora words, annotated suffix trees, etc.