I am trying to read in a dataset called df1, but it does not work
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
Here are huge errors from the above code, but this is the most relevant
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:
Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote,
’:Demo:
I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with
pd.read_csv()the data is correctly decoded, but loading from the URL produces re-coded data:This is a known bug in Pandas. You can work around this by using
urllib.requestto load the URL and pass that topd.read_csv()instead: