Python error: TypeError: Expected string or bytes-like object

6.7k views Asked by At

I'm currently working on a sentiment analysis project using nltk in python. I can't get my script to pass in rows of text from my csv to perform tokenization on. However, if I pass the text in one entry at a time it works fine. I am getting one persistent error: 'TypeError: expected string or bytes-like object' when I try and pass the whole csv in. Here is the printed data frame and python code I'm using. Any help to resolve this issue would be great.

                              abstract
0    Allergic diseases are often triggered by envir...
1    omal lymphopoietin (TSLP) has important roles ...
2    of atrial premature beats, and a TSLP was high...
3     deposition may play an important role in the ...
4    ted by TsPLP was higher than that mediated by ...
5    nal Stat5 transcription factor in that TSLP st...
data = pd.read_csv('text.csv', sep=';', encoding = 'utf-8')
x = data.loc[:, 'abstract']
print(x.head())
tokens = nltk.word_tokenize(x)
print(tokens)

Attached is the full stack trace error. EDIT: print statement

enter image description here

EDIT: Output

enter image description here

2

There are 2 answers

7
Ta_Req On

tokens = [nltk.word_tokenize(line) for line in x ]

7
0buz On

The nltk documentation give an example of nltk.word_tokenize usage where you may notice "sentence" is a string.

In your situation, x is a dataframe Series(of strings), which you need to reconstruct into a string before passing it to nltk.word_tokenize.

One way to deal with this is to create your nltk "sentence" from x:

x = data.loc[:, 'abstract']
sentence=' '.join(x)
tokens = nltk.word_tokenize(sentence)

EDIT: Try this as per further comments (remember this will be a Series of tokens to be accessed accordingly):

tokens=x.apply(lambda sentence: nltk.word_tokenize(sentence))