C# - Korean Encoding

Asked by At

This might be different with other Korean encoding questions.

There is this site I have to scrape and it's Korean.

An example sentence in their site is this "개인정보보호를 위해 뒤로가기 버튼 대신 검색결과 화면 상단과 하단의 이전 버튼을 사용하시기 바랍니다."

I am using HttpWebRequest and HttpWebResponse to scrape the site.

this is how I retreive the html

-- partial code --

using (Stream data = resp.GetResponseStream())
{
    response.Append(new StreamReader(data, Encoding.GetEncoding(code), true).ReadToEnd());
}

now my problem is, am not getting the correct Korean characters. In my "code" variable, I'm basing the code page here in MSDN http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx (let me narrow it down).

here are the Korean code pages: 51949, 50225, 20949, 20833, 10003, 949

but am still not getting the correct Korean characters? What you think is the problem?

2 Answers

3
Oded On Best Solutions

It is very likely that the page is not in a specific Korean encoding, but one of the Unicode encodings.

Try Encoding.UTF8, Encoding.Default (UTF-16) instead of the specific code pages. There are also Encoding.UTF7 and Encoding.UTF32, but they are not as common.

To be certain, examine the meta tags and headers for the content-type returned by the server.


Update (gleaned from commments):

Since the content-type header is EUC-KR, the corresponding codepage is 51949 and this is what you need to use to retrieve the page.

It was not clear that you are writing this out to a file - you need to use the same encoding when writing the file out, or convert the byte[] from the original to the output file encoding (using Encoding.Convert).

0
Aleksey Dr. On

While having exact same issue I've finished it with code below:

Encoding.UTF8.GetString(DownloadData(URL));

This directly transform output for the WebClient GET request to UTF8 encoding.