I'm trying to access a text file from project gutenberg's url. Hence I've copyed the same code from nltk book's, the result was different.
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
raw[:75]
This was from nltk book. When it worked properly, it should print out,
’The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n’
But when I tried the same on my computer, it came out with this,
'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'
I think it's a problem with the headers in project gutenberg. Could you help me how to deal with this?
The URL response text seems encoded in UTF-8 with BOM.
Try:
See this answer for more information