I'm trying to do an exercise of NLP in Kaggle and when I'm doing the data cleaning of the text that I have to use to predict the output I can't get it to be separated by words, instead I get one sentence with all the words attached.
This is my text_cleaner function:
def text_cleaner(text):
text = str(text).lower() #lowercase
text = re.sub('\d+', '', text) #remove numbers
text = re.sub('\[.*?\]','', text) #remove html tags
text = re.sub(r'https?://\S+|www\.\S+','',text) #remove url
text = re.sub(r'\bhtml\b', '', text) #remove html word
text = re.sub(r'['
u'\U0001F600-\U0001F64F' # emoticons
u'\U0001F300-\U0001F5FF' # symbols & pictographs
u'\U0001F680-\U0001F6FF' # transport & map symbols
u'\U0001F1E0-\U0001F1FF' # flags (iOS)
u'\U000024C2-\U0001F251' #removes emojis
']+', '',text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removes punctuation
text = re.sub('[^a-z]','',text) #removes non-alphabeticals
text = text.replace('#', '')
text = text.replace('@', '')
text = stop_words(text)
return text
def stop_words(text):
lem = WordNetLemmatizer()
stop = set(stopwords.words('english'))
punctuation = list(string.punctuation)
text =text.split()
text= [lem.lemmatize(word) for word in text if word not in stop]
text = ' '.join(text)
return text
And this is the result that I got:
instead of:
deed reason earthquake may allah forgive u...
This line
text = re.sub('[^a-z]','',text) #removes non-alphabeticals
will remove everything except the lowercase characters a to z, including whitespaces.If you replace it with
re.sub('[^a-z ]','',text)
, so "remove everything except a to z or spaces", it should work.Also all of this:
and this:
will not do anything as all these lines do is removing certain single characters, but all of these characters are already removed by this
re.sub('[^a-z ]','',text)