Why mb_strlen and strlen from post value is incorrect PHP?

263 views Asked by At

With this code, when I fill 漢字 into an input element with type text and name text and press submit button, its shows mb_strlen : 16 and strlen : 16

<?php
include("connect.php");
if(isset($_POST["submit"]))
{
    $string = mysqli_real_escape_string($db_mysqli,$_POST['text']);
    //$string = "漢字";

    echo $string."<BR>";
    echo "mb_strlen : ".mb_strlen($string, 'utf-8')."<BR>";
    echo "strlen : ".strlen($string)."<BR>";

    if(strlen($string) != mb_strlen($string, 'utf-8'))
    { 
        echo "Please enter English words only:(";
    }
    else 
    {
        echo "OK, English Detected!";
    }
}
?>

<form method="post" ENCTYPE = "multipart/form-data">
<input type="text" name="text">
<input type="submit" name="submit" value="OK" id="button-blue" style=" float: none; ">
</form>

But when use this code, it's will show mb_strlen : 2 and strlen : 6

I want to know , why the value from above code is incorrect and how to apply?

<?php
    $string = "漢字";

    echo $string."<BR>";
    echo "mb_strlen : ".mb_strlen($string, 'utf-8')."<BR>";
    echo "strlen : ".strlen($string)."<BR>";

    if(strlen($string) != mb_strlen($string, 'utf-8'))
    { 
        echo "Please enter English words only:(";
    }
    else 
    {
        echo "OK, English Detected!";
    }
?>
1

There are 1 answers

0
Muhammad Abdul-Rahim On

There are likely some gotchas with this answer—which will require later revision—but instead of using strlen we can use Regex to check if the input string has non-Latin characters.

Code:

$string = '漢字';
$matches = array();
$pattern = '/^[^\p{Latin}]+$/u';
preg_match($pattern, $string, $matches);
print_r($matches);

Results:

Array
(
    [0] => 漢字
)

If I tested with This is a Latin string jasDLFKL@##$&()@!!! I get an empty array back. I don't believe this is a foolproof solution, but more of a good first step.

Please note that the definition of the Latin character range for Regex is U+0000–U+007F. This Regex Tutorial Page goes into detail about Unicode. Also note that my pattern has a u flag, for Unicode. That will be necessary to include.