PHP - detecting non-English letters and filtering input

1.6k views Asked by At

There's a comment form where I'd want people to be able to write in foreign languages too. But, for example, my spam-filtering mechanism would block something naiive as the word "été" simply because it has no vowels in it (english vowels that is).

My question is, when using regex for detecting vowels like:

$pattern = '/[aeiou]/';

I cannot simply write

$pattern = '/[aeiouéáíúó...]/';

and the server would interpret that well. How can I do this so that it IS interpreted well?

For non-latin alphabets like russian and hebrew, is there a method that I can detect which language the content belongs to and perform an appropriate spam-filtering mechanism?

The purpose of the whole spam-filtering is to block anything like: "gjkdkgahg" or "ttt", it's a publicly visible page.

3

There are 3 answers

1
bobince On BEST ANSWER
$pattern = '/[aeiouéáíúó]/';

Use the u modifier to get Unicode-aware regex and that should work, assuming you're working with UTF-8 strings throughout your app, which you should be really.

For non-latin alphabets like russian and hebrew, is there a method that I can detect which language the content belongs to and perform an appropriate spam-filtering mechanism?

Basic Russian is found in Unicode range U+0400–U+04FF; vowels are аэыуояеёюи. Hebrew is in range U+0590–U+05FF and doesn't use vowels in the same way. I don't think detecting vowels is terribly useful... you might have more luck with a simple dictionary covering many languages, as long as you stick to languages that have clear word boundaries. Not much use for Chinese.

I don't think that this sort of thing is a good anti-spam mechanism at all. It's as likely to false-positive as it is to spot spam, which is after all very often proper words. Varying spoiler fields (CSS-hidden inputs that must be left blank but won't be by bots) and one-use or limited-time submission tokens are much more likely to be effective.

2
erenon On

Hmm, personally I don't find a spam filter like yours too effective. IMO it is much better to watch for links, strong words, and sexual/warez related words, spam often contain them. You could restrict the commend right only for registered users, and you could delete them as moderator before they show up, if they comes from untrusted(=from unregistered user) source.

0
jheddings On

You could use the normalizer to find strings with accented characters:

<?
    if (! normalizer_is_normalized($input)) {
        // handle non-normalized input
    }
?>

If needed, you could also use this class to normalize strings to search for vowels:

<?
    $norm = normalizer_normalize($input);
    if (! preg_match('/[aeiou]/', $norm)) {
        // handle no-vowels in input
    }
?>

You'll also want to read about the default normalization form and make sure that it satisfies your requirements.