Project Gutenberg accessing text with url

1.9k views Asked by At

I'm trying to access a text file from project gutenberg's url. Hence I've copyed the same code from nltk book's, the result was different.

from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
raw[:75]

This was from nltk book. When it worked properly, it should print out,

’The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n’

But when I tried the same on my computer, it came out with this,

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

I think it's a problem with the headers in project gutenberg. Could you help me how to deal with this?

1

There are 1 answers

0
gcharbon On

The URL response text seems encoded in UTF-8 with BOM.

Try:

from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

response = request.urlopen(url)
raw = response.read()
text = raw.decode("utf-8-sig")

See this answer for more information