ANSI vs UTF-8 in web Browser

2.7k views Asked by At

My requirement is to allow users to use(type) ANSI characters instead of utf-8 when they are typing in to the text fields of my webpages.

I looked at the setting of the character set in html meta tag

 <meta charset="ISO-8859-1"> 

That was helpful to display the content in ANSI instead of UTF-8, but it does not stop users typing in utf-8. Any help is appreciated.

2

There are 2 answers

2
deceze On BEST ANSWER

Let's distinguish between two things here: characters the user can type and the encoding used to send this data to the server. These are two separate issues.

A user can type anything they want into a form in their browser. For all intents and purposes these characters have no encoding at this point, they're pure "text"; encodings do not play a role just yet and you cannot restrict the set of available characters with encodings.

Once the user submits the form, the browser will have to encode this data into binary somehow, which is where an encoding comes in. Ultimately the browser decides how to encode the data, but it will choose the encoding specified in the HTTP headers, meta elements and/or accept-charset attribute of the form. The latter should always by the deciding factor, but you'll find buggy behaviour in the real world (*cough*cough*IE*cough*). In practice, all three character set definitions should be identical to not cause any confusion there.

Now, if your user typed in some "exotic" characters and the browser has decided to encode the data in "ANSI" and the chosen encoding cannot represent those exotic characters, then the browser will typically replace those characters with HTML entities. So, even in this case it doesn't restrict the allowed characters, it simply finds a different way to encode them.

How can I know what encoding is used by the user

You cannot. You can only specify which character set you would like to receive and then double check that that's actually what you did receive. If the expectation doesn't match, reject the input (an HTTP 400 Bad Request response may be in order).

If you want to limit the acceptable set of characters a user may input, you need to do this by checking and rejecting characters independent of their encoding. You can do this in Javascript at input time, and will ultimately need to do this on the server again (since browser-side Javascript ultimately has no influence on what can get submitted to the server).

0
Pradeep Singh On

If you set the encoding of the page to UTF-8 in a and/or HTTP header, it will be interpreted as UTF-8, unless the user deliberately goes to the View->Encoding menu and selects a different encoding, overriding the one you specified.

In that case, accept-encoding would have the effect of setting the submission encoding back to UTF-8 in the face of the user messing about with the page encoding. However, this still won't work in IE, due the previous problems discussed with accept-encoding in that browser.

So it's IMO doubtful whether it's worth including accept-charset to fix the case where a non-IE user has deliberately sabotaged the page encoding