Below is a example string -
$string = "abcde वायरस abcde"
I need to check weather this string contains any Hindi (Devanagari) content and if so the count of characters and words. I guess regex with unicode character class can work http://www.regular-expressions.info/unicode.html. But I am not able to figure out the correct regex statement.
To find out, if a string contains a Hindi (Devanagari) character, you need to have a full list of all Hindi characters. According to this website, the Hindi characters are the hexadecimal characters between
0x0900
and0x097F
(decimal 2304 to 2431).The regular expression pattern needs to match, if any of those characters are in the set. Therefore, you can use a pattern (actually a set of characters) to match the string, which looks like this:
[\u0900\u0901\u0902
...\u097D\u097E\u097F]
Because it is rather cumbersome to manually write this list of characters down, you can generate this string by iterating over the decimal characters from 2304 to 2431 or over the hexadecimal characters.
To count all words containing at least one Hindi character, you can use the following pattern. It contains white-space (
\s
) around the word or the beginning (^
) or the end ($
) around the word, and a global flag, to match every occurence (/g
):/(?:^|\s)[\u0900\u0901\u0902
...\u097D\u097E\u097F]+?(?:\s|$)/g
Here is a live implementation in JavaScript: