I'm using HTMLCleaner to clean an HTML file which has characters like '€' (ascii decimal 128), 'TM' (ascii decimal 153), etc. That is, chars from the ASCII extended table.
HTMLCleaner cannot handle those chars and replaces them by character '?' (ascii decimal 63).
Is there any flag I can set in HTMLCleaner in order to process those chars?
Thanks in advance.
EDIT: The variable "encoding" is "iso-8859-1", just like the source file encoding.
try {
System.out.print("Parsing and cleaning:" + fileStr);
URL url = new File(this.fileStr).toURI().toURL();
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// default properties
CleanerProperties props = cleaner.getProperties();
// do parsing
TagNode tagNode = new HtmlCleaner(props).clean(url);
// serialize to XML file
new PrettyXmlSerializer(props).writeToFile(tagNode, fileStr,
encoding);
System.out.println("Output: " + fileStr);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
I've just figured this out. The line:
TagNode tagNode = new HtmlCleaner(props).clean(url);
Shoube be replaced by:
TagNode tagNode = new HtmlCleaner(props).clean(url, encoding);
Where 'encoding' is the string representation of the charset of the source url.
Thank you!
Did you try setting the
charset
?