Hot deploying HTML templates generates question marks in the place of chinese characters - only on CentOS

1k views Asked by At

I am currently testing a Java web project to display Chinese characters using Freemarker templates.My development environment is Ubuntu 14.4 and the current project is deployed on JBoss 4 application server.In all templates the HTTP header "Content-Type: text/html; charset=UTF-8" is set.

In my developer environment hot deploying HTML or Free marker templates does not generate question marks in the place of accented characters nor is encoding needed to be explicitly specified in the server.xml.

The staging server uses CentOS 6.3 and the applications server is JBoss 4. Firstly in order to merely display Chinese characters properly an additional encoding related entry is required in the server.xml which was not required in the development environment.(E.g. URIEncoding="UTF-8")

Additionally if a HTML or Freemarker template is hot deployed into the staging server it generates question marks in the place of Chinese characters displayed in templates.In order to overcome this scenario a server restart is required after deleting the work and tmp folders in JBoss deploy.

What could cause this awkward behavior in CentOS only? I have failed to generate this scenario in the Ubuntu test environment but it could be easily generated in the staging server.Are there any additional configuration that I may have over looked in CentOS regarding character encoding?

I did refer to many questions on a similar line but decided to post this as any of these did not offer sufficient insight to the problem at hand.

Stack Overflow resources referenced

Freemarker encoding - question marks in the place of accented characters

FreeMarker Not Able Display Chinese Character

Freemarker utf-8 encoding problems on t.page

FreeMarker encoding confusion

Freemarker resources

Why do I have ``?''-s in the output instead of character X?

Charset issues

Updates based on the suggested changes and comments I made some changes to the code

In order to explicitly set encoding details

Set Encoding in .bashrc, set the following: export LC_ALL=en_US.UTF-8
Set in Run.sh JAVA_OPTS section : JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8"

In order to get Locale details used by JVM added utility code block

System.out.println("Init file.encoding= " +  System.getProperty("file.encoding"));
System.out.println("Init Default Charset=" + Charset.defaultCharset());
System.out.println("Init Default Charset in Use=" + getDefaultCharSet());

Can get it locale information in FreeMarker using getlocale() method

/*Encoding properties Check - Using getEncoding()*/
Locale locale = cfg.getLocale();
String encodingWithLocale = cfg.getEncoding(locale);

In Freemarker init() setDefault encoding to UTF-8

cfg.setDefaultEncoding("UTF-8");

Before template operation set/checked locale based encoding details in freeMarker methods

/*Specific Encoding Properties*/
  Locale locale = cfg.getLocale();
  cfg.setEncoding(locale, "UTF-8");

Set Output Encoding() before output operation

cfg.setOutputEncoding("UTF-8");

Set Encoding in all HTML pages/templates using

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta charset="utf-8"/>

But I have a concern regarding output stream encoding when using Freemarker in my program.

According to FreeMarker support documents.

The charset used for the output stream is not decided by FreeMarker, but by you, when you create the Writer that you pass to the process method of the template.

and

Note that the charset of the template is independent from the charset of the output that the tempalte generates (unless the enclosing software deliberately sets the output charset to the same as the template charset).

I have used StringWriter functionality , which offers the required writer functionality but specifying encoding seems to be a problem.

StringWriter sw = new StringWriter();
Template tmpl = cfg_components.getTemplate(template,"utf-8");
cfg_components.setOutputEncoding("UTF-8");
rootMap.put("content_cdn_path", getContent_cdn_path());
...
tmpl.process(rootMap, sw);
return sw.getBuffer();

I also encode the HttpServletRequest and and HttpServletResponse stream with UTF encoding in the ActionServlet and that seems to offer a solution on Ubuntu 14.4 developer environment.

@Override
protected void process(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {

request.setCharacterEncoding("UTF8");
response.setCharacterEncoding("UTF8");

But what additionally may be required on CentOS 6.3 regarding the same program?

Any suggestion on how I can specify the output stream encoding or an alternative for StringWriter that could achieve the same?

1

There are 1 answers

3
Stephen C On

What could cause this awkward behavior in CentOS only?

Possibly your code depends on the default character encoding being UTF-8. If the default character encoding on your CentOS system is (for example) LATIN-1 rather than UTF-8, then any Chinese characters will be replaced with question marks.

If this is the problem then the solution is to use an explicit character encoding scheme at the appropriate point.

Without seeing the relevant parts of your code, it is hard to predict where the mistake has been made. However characters being replaced by question marks is a solid indicator that an incorrect encoding being used ... somewhere.


Actually, there is a simple way to confirm this theory: look at the locale environment variables that are in effect when you launch JBoss. For example, run the locale command. For me, it says:

$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

If your locale settings are incorrect, try changing them (for the current shell) before launching JBoss.


UPDATE - Looking at your code snippets, you seem to have a misconception about StreamWriter. A StringWriter accumulates Java characters ... as characters. It doesn't encode them. Then when you do this:

    tmpl.process(rootMap, sw);
    return sw.getBuffer();

what gets returned is a StringBuffer which contains a sequence of Java characters. Again the characters are not encoded yet.

The encoding of characters as bytes (apparently using the wrong encoding scheme) is happening later than this; i.e. either in some code of yours where you are converting the StringBuffer content to bytes, or maybe in the servlet infrastructure itself.