there are some similar questions out there, but none that are quite the same or that have an answer that works for me.
I need a javascript function which validates whether a text field contains all valid latin characters, so no cryllic or Chinese, just latin; specifically:
Basic Latin (excluding the C0 control characters), Latin-1 (excluding the C1 control characters), Latin Extended A, Latin Extended B and Latin Extended Additional. This set corresponds to Unicode code points U+0020 to U+007E, U+00A0 to U+024F and U+IE00 to U+IEFF
Some of the answers out there seem to check the first character in the text field but miss out others, so these are no good.
This is what I have tried so far (this doesn't work!):
var value = 'abcdef' // from text field
var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// var re = '\\w+/'; // alternative
if (new RegExp(re).test(value)) {
result = false;
}
The following sort of works but only for the first character:
//var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// couldn't get the above to work so using the following:
var re = '\\w+';
if (!value.match(re)) {
message = 'Please enter valid latin characters only';
$focusField = $this;
}
What is the right way to do this?
I really need code, rather than an explaination, but both would be better.
Thanks
EDIT: Note that the solution given in the accepted answer is incorrect. It is full of false positives and false negatives. The exact numeric code point numbers needed are given at the bottom of this post.
The example given by the question mistakenly attempt to use Block rather than Script properties!
You do not want to use Unicode block character properties here; you want to use Unicode script character properties. In other words, you really want
Script=Latinand not to try to useBlock=Basic_LatinplusBlock=Latin_1plusBlock=Latin_1_SupplementplusBlock=Latin_Extended_AplusBlock=Latin_Extended_Additional.Note also that the question neglected to other Latin blocks:
Block=Latin_Extended_CandBlock=Latin_Extended_D.Even if you used the correct blocks, you would get 145 false positives that were in those blocks but which were not Latin script characters:
Furthermore, you would miss 403 false negatives that are indeed Latin script characters but which are not in those blocks:
You virtually never want to use Blocks; you want to use Scripts. That’s why Level 1 conformance of UTS#18 requires in Requirement 1.2that the Script character property be supported, but says nothing of the Block property until Requirement 2.7: Full Properties.
See UTS#18 Annex A, Character Blocks, for more pitfalls that come of using Blocks instead of Scripts.
Removing the code points that lie outside the Basic Multilingual Plane due to the Javascript bug that makes it impossible to specify these by ranges, we are left with this set of insanely unmaintainable garbledy-gook needed to fish out all Unicode v6.2 code points having the Latin, Common, or Inherited script character property:
Personally, I would fire anyone who attempted to use that sort of nonsense.
Furthermore, 3,225 code points that you miss because of the Javascript bug in handling full Unicode are the following:
The correct way to do all this is included below.
If you are going to be playing around with Unicode character properties, it is tantamount to hopeless to hardcode code-point numbers like this. What you really want is to be able to say something like:
However, Javascript regexes are still completely antemillennial in this regard, and are so far from complying with Unicode Technical Standard #18: Unicode Regular Expressions, even at its very most basic compliance level, level one:
Because even the most rudimentary compliance level for Unicode regular expressions is still far beneath Javascript’s capabilities, I strongly recommending running whatever Unicode-aware regexes you need on the server in some language that actually supports them.
However, in the event that this is not practical, a sanity-saving workaround is the Javascript XRegExp plugin, which provides a saner regex library that also allows for access to certain essential character properties such as you are attempting to use.
As of v2.0, the “XRegExp All” add-on supports all these:
Which means that once you have it loaded, you will be able to get at the properties you need this way:
Please note very carefully that as of Unicode v6.2, any and all of the following code points and code-point ranges are deemed to have the
Script=Latincharacter property:Whereas these are the code points that have the
Script=Commoncharacter property:And these are the code points that have the
Script=Inheritedcharacter property:I hope the terrible maintenance, upkeep, legibility, and indeed writability problems that come of using literal code-point numbers like these make it clear that you want to at a bare minimum use the
XRegExpadd-ons.