How to remove these iconv-translated ASCII question mark characters from this string?

121 views Asked by At

I'm translating user-submitted strings from UTF-8 to ASCII-Printable:

$str = 'Thê qúïck  brõwn fõx júmps? Óvér thé lázy dõg?';

$out = iconv('UTF-8', 'ASCII//TRANSLIT', $str);

var_dump($out);

$out = 'The quick ? brown fox jumps?? Over the lazy dog??';

I want the extra ? question marks from $out removed.

if ($out !== $str && strpos($out, '?') !== false) {
    // The input string was modified and contains at least one question mark
    //
    // Not even really sure where to begin
    //
    // Do we need to compare the position of every character from the
    // original string to every position of the new string and replace
    // where the original string did not contain a question mark?
    //
    // That's all I can think of, but there has to be a better way.
}

I want to keep all //TRANSLIT characters, including those few included in the example $str above, e.g.áéïõú = aeiou. There is no other nuace to this question. I think it boils down to a string comparison and replace question.

I'm not necessarily looking for someone to write the entire code, just a pointer in the right direction of how you'd tackle this.

2

There are 2 answers

3
Olivier On BEST ANSWER

Here is a solution based on transliterator_transliterate():

$str = transliterator_transliterate('Latin-ASCII', 'Thê qúïck  brõwn fõx júmps? Óvér thé lázy dõg?');
$str = preg_replace('/[\x80-\xFF]/', '', $str);
echo $str;

Output:

The quick  brown fox jumps? Over the lazy dog?

Note that the emoji are kept by transliterator_transliterate(), so I used a regex to remove all the remaining non-ASCII characters.

0
Jeff On

This works for me, although I'm sure there are better solutions that people can come up with.

$str = 'Thê qúïck  brõwn fõx júmps? Óvér thé lázy dõg?';
$out = 'The quick ? brown fox jumps?? Over the lazy dog??';

Output

var_dump(remove_iconv_question_marks($str, $out));

// string(46) "The quick   brown fox jumps?  Over the lazy dog? "

Function

/**
 * strip_iconv_question_marks - Remove question marks left behind by iconv()
 * after translating UTF-8 strings to ASCII strings
 *
 * @param string $str_utf8
 * @param string $str_ascii
 *
 * @return string
 */

function strip_iconv_question_marks($str_utf8, $str_ascii) {
    $arr_utf8 = mb_str_split($str_utf8);
    $arr_ascii = mb_str_split($str_ascii);

    $count = count($arr_utf8);

    for ($i = 0; $i < $count; $i++) {
        if ($arr_ascii[$i] === '?') {
            if ($arr_utf8[$i] !== '?') {
                $arr_ascii[$i] = ' '; // Prefer blank space over removal
            }
        }
    }
    return implode($arr_ascii);
}

For PHP < 7.4.0

function mb_str_split($str, $len = 1) {
    $arr = [];
    $cnt = mb_strlen($str, 'UTF-8');

    for ($i = 0; $i < $cnt; $i++) {
        $arr[] = mb_substr($str, $i, $len, 'UTF-8');
    }
    return $arr;
}