Can't get the text separated by words when I'm doing data cleaning in NLP

23 views Asked by At

I'm trying to do an exercise of NLP in Kaggle and when I'm doing the data cleaning of the text that I have to use to predict the output I can't get it to be separated by words, instead I get one sentence with all the words attached.

This is my text_cleaner function:

def text_cleaner(text):
    text = str(text).lower() #lowercase
    text = re.sub('\d+', '', text) #remove numbers
    text = re.sub('\[.*?\]','', text) #remove html tags
    text = re.sub(r'https?://\S+|www\.\S+','',text) #remove url
    text = re.sub(r'\bhtml\b', '', text) #remove html word
    
    text = re.sub(r'['
                           u'\U0001F600-\U0001F64F'  # emoticons
                           u'\U0001F300-\U0001F5FF'  # symbols & pictographs
                           u'\U0001F680-\U0001F6FF'  # transport & map symbols
                           u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
                           u'\U00002702-\U000027B0' 
                           u'\U000024C2-\U0001F251'  #removes emojis
                           ']+', '',text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removes punctuation
    text = re.sub('[^a-z]','',text) #removes non-alphabeticals
    text = text.replace('#', '')
    text = text.replace('@', '')
    text = stop_words(text)
    
    return text

def stop_words(text):
    lem = WordNetLemmatizer()
    stop = set(stopwords.words('english'))
    stop.remove('not')
    punctuation = list(string.punctuation)
    stop.update(punctuation)
    
    text =text.split()
    text= [lem.lemmatize(word) for word in text if word not in stop]
    text = ' '.join(text)
    
    return text

And this is the result that I got:

ourdeedsarethereasonofthisearthquakemayallahfo...

instead of:

deed reason earthquake may allah forgive u...

Thanks!

1

There are 1 answers

0
ewz93 On BEST ANSWER

This line text = re.sub('[^a-z]','',text) #removes non-alphabeticals will remove everything except the lowercase characters a to z, including whitespaces.

If you replace it with re.sub('[^a-z ]','',text), so "remove everything except a to z or spaces", it should work.

Also all of this:

text = re.sub(r'['
                           u'\U0001F600-\U0001F64F'  # emoticons
                           u'\U0001F300-\U0001F5FF'  # symbols & pictographs
                           u'\U0001F680-\U0001F6FF'  # transport & map symbols
                           u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
                           u'\U00002702-\U000027B0' 
                           u'\U000024C2-\U0001F251'  #removes emojis
                           ']+', '',text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removes punctuation

and this:

text = text.replace('#', '')
text = text.replace('@', '')

will not do anything as all these lines do is removing certain single characters, but all of these characters are already removed by this re.sub('[^a-z ]','',text).