Can't get the text separated by words when I'm doing data cleaning in NLP

Question

Can't get the text separated by words when I'm doing data cleaning in NLP

63 views Asked by Francisco Vives At 07 July 2023 at 19:41

I'm trying to do an exercise of NLP in Kaggle and when I'm doing the data cleaning of the text that I have to use to predict the output I can't get it to be separated by words, instead I get one sentence with all the words attached.

This is my text_cleaner function:

def text_cleaner(text):
    text = str(text).lower() #lowercase
    text = re.sub('\d+', '', text) #remove numbers
    text = re.sub('\[.*?\]','', text) #remove html tags
    text = re.sub(r'https?://\S+|www\.\S+','',text) #remove url
    text = re.sub(r'\bhtml\b', '', text) #remove html word
    
    text = re.sub(r'['
                           u'\U0001F600-\U0001F64F'  # emoticons
                           u'\U0001F300-\U0001F5FF'  # symbols & pictographs
                           u'\U0001F680-\U0001F6FF'  # transport & map symbols
                           u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
                           u'\U00002702-\U000027B0' 
                           u'\U000024C2-\U0001F251'  #removes emojis
                           ']+', '',text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removes punctuation
    text = re.sub('[^a-z]','',text) #removes non-alphabeticals
    text = text.replace('#', '')
    text = text.replace('@', '')
    text = stop_words(text)
    
    return text

def stop_words(text):
    lem = WordNetLemmatizer()
    stop = set(stopwords.words('english'))
    stop.remove('not')
    punctuation = list(string.punctuation)
    stop.update(punctuation)
    
    text =text.split()
    text= [lem.lemmatize(word) for word in text if word not in stop]
    text = ' '.join(text)
    
    return text

And this is the result that I got:

ourdeedsarethereasonofthisearthquakemayallahfo...

instead of:

deed reason earthquake may allah forgive u...

Thanks!

Original Q&A

There are 1 answers

**ewz93** · Accepted Answer · 2023-07-08T11:51:44+00:00

This line text = re.sub('[^a-z]','',text) #removes non-alphabeticals will remove everything except the lowercase characters a to z, including whitespaces.

If you replace it with re.sub('[^a-z ]','',text), so "remove everything except a to z or spaces", it should work.

Also all of this:

text = re.sub(r'['
                           u'\U0001F600-\U0001F64F'  # emoticons
                           u'\U0001F300-\U0001F5FF'  # symbols & pictographs
                           u'\U0001F680-\U0001F6FF'  # transport & map symbols
                           u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
                           u'\U00002702-\U000027B0' 
                           u'\U000024C2-\U0001F251'  #removes emojis
                           ']+', '',text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removes punctuation

and this:

text = text.replace('#', '')
text = text.replace('@', '')

will not do anything as all these lines do is removing certain single characters, but all of these characters are already removed by this re.sub('[^a-z ]','',text).

TechQA.

Can't get the text separated by words when I'm doing data cleaning in NLP

There are 1 answers

Related Questions in REGEX

Related Questions in NLP

Related Questions in DATA-CLEANING

Related Questions in NLTOKENIZER

Popular Questions

Popular Tags

Trending Questions