There's a comment form where I'd want people to be able to write in foreign languages too. But, for example, my spam-filtering mechanism would block something naiive as the word "été" simply because it has no vowels in it (english vowels that is).
My question is, when using regex for detecting vowels like:
$pattern = '/[aeiou]/';
I cannot simply write
$pattern = '/[aeiouéáíúó...]/';
and the server would interpret that well. How can I do this so that it IS interpreted well?
For non-latin alphabets like russian and hebrew, is there a method that I can detect which language the content belongs to and perform an appropriate spam-filtering mechanism?
The purpose of the whole spam-filtering is to block anything like: "gjkdkgahg" or "ttt", it's a publicly visible page.
Use the
u
modifier to get Unicode-aware regex and that should work, assuming you're working with UTF-8 strings throughout your app, which you should be really.Basic Russian is found in Unicode range U+0400–U+04FF; vowels are аэыуояеёюи. Hebrew is in range U+0590–U+05FF and doesn't use vowels in the same way. I don't think detecting vowels is terribly useful... you might have more luck with a simple dictionary covering many languages, as long as you stick to languages that have clear word boundaries. Not much use for Chinese.
I don't think that this sort of thing is a good anti-spam mechanism at all. It's as likely to false-positive as it is to spot spam, which is after all very often proper words. Varying spoiler fields (CSS-hidden inputs that must be left blank but won't be by bots) and one-use or limited-time submission tokens are much more likely to be effective.