HtmlUnit: Encoding for Chinese Website

578 views Asked by At

I expect this is pretty basic:

When downloading pages from a Chinese website, all Chinese characters appear as "?" in the saved file (viw java NIO Files.write).

I know the Chinese webpage is retrieved as UTF-8 (page.getPageEncoding() returns "UTF-8"), but something goes wrong in my saving of the webpage.

My code is as follows:

    final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setTimeout(15000);
    final HtmlPage page = webClient.getPage(urlNow);





    pageAsXml = page.asXml();

    NioLog.getLogger().debug(page.getPageEncoding());





    Files.write(Paths.get(outputPath + File.separator + fileNameTruncated + TXT), pageAsXml.getBytes());
1

There are 1 answers

0
Jake On BEST ANSWER

The answer is as follows:

            barrayXml = page.asXml().getBytes(Charset.forName("UTF-8"));



            Files.write(Paths.get(outputPath + File.separator + fileNameTruncated + TXT), barrayXml );