I am trying to get the source code of Barack Obama's Wikipedia page and save it to a file.
Everything works well until I open the file and see some weird characters in it:
As you can see, EOT1024
appears in the file, but it does not appear in the website's actual source code, which I checked using Google Chrome. I would like to know why this is happening, and how I can stop it from happening.
My code:
#include <iostream>
#include <windows.h>
#include <wininet.h>
#include <fstream>
int main(){
std::string textLink = "https://en.wikipedia.org/wiki/Barack_Obama";
std::ofstream file;
HINTERNET hInternet, hFile;
char buf[1024];
DWORD bytes_read;
int finished = 0;
bool e=false;
std::string waste;
file.open("data.txt",std::ios::out);
hInternet = InternetOpenW(L"Whatever", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
if (hInternet == NULL) {
printf("InternetOpen failed\n");
}
hFile = InternetOpenUrl(hInternet, textLink.c_str(), NULL, 0L, 0, 0);
if (hFile == NULL) {
printf("InternetOpenUrl failed\n");
}
while (!finished) {
if (InternetReadFile(hFile, buf, sizeof(buf), &bytes_read)) {
if (bytes_read > 0) {
file << bytes_read << buf;
}
else {
finished = 1;
}
}
else {
printf("InternetReadFile failed\n");
finished = 1;
}
}
InternetCloseHandle(hInternet);
InternetCloseHandle(hFile);
file.close();
}
I have the text file as I view it in Notepad++ right here:
https://drive.google.com/open?id=1Ty-a1o29RWSQiO1zTLym6XH4dJvUjpTO
I don't understand why I would get those characters in the data.txt
file that I write to.
NOTE: occasionally, instead of seeing EOT1024
, I even get EOT21
, EOT1016
, and other seemingly random characters.
You're literally writing the integer
bytes_read
to the file:There's your "1024" (on the occasions that 1024 bytes were read).
Don't do that.
Furthermore, it looks like you're assuming
buf
is null-terminated. Instead, stream the firstbytes_read
ofbuf
; that's why you have that integer.So:
Consult the documentation: