I learnt how to scrap website from https://automatetheboringstuff.com. I wanted to scrap http://www.piaotian.net/html/3/3028/1473227.html in which the contents is in chinese and write its contents into a .txt file. However, the .txt file contains random symbols which I assume is a encoding/decoding problem.
I've read this thread "how to decode and encode web page with python?" and figured the encoding method for my site is "gb2312" and "windows-1252". I tried decoding in those two encoding methods but failed.
Can someone kindly explain to me the problem with my code? I'm very new to programming so please let me know my misconceptions as well!
Also, when I remove the "html.parser" from the code, the .txt file turns out to be empty instead of having at least symbols. Why is this the case?
import bs4, requests, sys
reload(sys)
sys.setdefaultencoding("utf-8")
novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")
content = novelSoup.select("br")
novelFile = open("novel.txt", "w")
for i in range(len(content)):
novelFile.write(str(content[i].getText()))
out: