Project Gutenberg accessing text with url

Question

Project Gutenberg accessing text with url

1.9k views Asked by Lee At 19 May 2020 at 18:13

I'm trying to access a text file from project gutenberg's url. Hence I've copyed the same code from nltk book's, the result was different.

from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
raw[:75]

This was from nltk book. When it worked properly, it should print out,

’The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n’

But when I tried the same on my computer, it came out with this,

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

I think it's a problem with the headers in project gutenberg. Could you help me how to deal with this?

Original Q&A

There are 1 answers

**gcharbon** · Answer 1 · 2020-05-19T18:19:23+00:00

gcharbon On 19 May 2020 at 18:19

The URL response text seems encoded in UTF-8 with BOM.

Try:

from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

response = request.urlopen(url)
raw = response.read()
text = raw.decode("utf-8-sig")

See this answer for more information

TechQA.

Project Gutenberg accessing text with url

There are 1 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in NLTK

Related Questions in PROJECT-GUTENBERG

Popular Questions

Popular Tags

Trending Questions