I am trying to get the source code of Barack Obama's Wikipedia page and save it to a file.

Everything works well until I open the file and see some weird characters in it:

image

As you can see, EOT1024 appears in the file, but it does not appear in the website's actual source code, which I checked using Google Chrome. I would like to know why this is happening, and how I can stop it from happening.

My code:

#include <iostream>
#include <windows.h>
#include <wininet.h>
#include <fstream>
int main(){
    std::string textLink = "https://en.wikipedia.org/wiki/Barack_Obama";
    std::ofstream file;
    HINTERNET hInternet, hFile;
    char buf[1024];
    DWORD bytes_read;
    int finished = 0;
    bool e=false;
    std::string waste;

        file.open("data.txt",std::ios::out);
        hInternet = InternetOpenW(L"Whatever", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
        if (hInternet == NULL) {
            printf("InternetOpen failed\n");
        }
        hFile = InternetOpenUrl(hInternet, textLink.c_str(), NULL, 0L, 0, 0);
        if (hFile == NULL) {
            printf("InternetOpenUrl failed\n");
        }
        while (!finished) {
            if (InternetReadFile(hFile, buf, sizeof(buf), &bytes_read)) {
                if (bytes_read > 0) {
                    file  << bytes_read << buf;
                }
                else {
                    finished = 1;
                }
            }
            else {
                printf("InternetReadFile failed\n");
                finished = 1;
            }
        }
        InternetCloseHandle(hInternet);
        InternetCloseHandle(hFile);
        file.close();
}

I have the text file as I view it in Notepad++ right here:

https://drive.google.com/open?id=1Ty-a1o29RWSQiO1zTLym6XH4dJvUjpTO

I don't understand why I would get those characters in the data.txt file that I write to.

NOTE: occasionally, instead of seeing EOT1024, I even get EOT21, EOT1016, and other seemingly random characters.

1 Answers

5
Lightness Races in Orbit On Best Solutions

You're literally writing the integer bytes_read to the file:

file  << bytes_read << buf;

There's your "1024" (on the occasions that 1024 bytes were read).

Don't do that.

Furthermore, it looks like you're assuming buf is null-terminated. Instead, stream the first bytes_read of buf; that's why you have that integer.

So:

file.write(&buf[0], bytes_read);

Consult the documentation:

A normal read retrieves the specified dwNumberOfBytesToRead for each call to InternetReadFile until the end of the file is reached. To ensure all data is retrieved, an application must continue to call the InternetReadFile function until the function returns TRUE and the lpdwNumberOfBytesRead parameter equals zero.