Sanitizing sentence in PHP with preg_replace

726 views Asked by At

This is my current sentence sanitizing function:

# sanitize sentence
function sanitize_sentence($string) {
    $string = preg_replace("/(?<!\d)[.,!?](?!\d)/", '$0 ', $string); # word,word. > word, word.
    $string = preg_replace("/(^\s+)|(\s+$)/us", "", preg_replace('!\s+!', ' ', $string)); # " hello    hello " > "hello hello"
    return $string;
}

Running some tests with this string:

$string = '     Helloooooo my frieeend!!!What are you doing??    Tell me what you like...........,please. ';

The result is:

echo sanitize_sentence($string);  
Helloooooo my frieeend! ! ! What are you doing? ? Tell me what you like. . . . . . . . . . . , please.

As you can see, I already managed to resolve some of the requirements, but i'm still stuck with some details. The final result should be:

Helloo my frieend! What are you doing? Tell me what you like..., please.

Which means, that all these requirements should be accomplished:

  1. There can be only one or three consecutive periods . or ...
  2. There can be only one consecutive comma ,
  3. There can be only one consecutive question mark ?
  4. There can be only one consecutive exclamation mark !
  5. A letter cannot repeat itself more than 2 times in a word. E.g.: mass (right), masss (wrong, and should be converted to mass)
  6. A space should be added always after these characters .,!? This is already working fine!
  7. In the case of 3 consecutive periods, the space is added only after the last period.
  8. Extra spaces (more than one space) should be eliminated and trimmed form both ends of the sentences. This is already working fine!
4

There are 4 answers

5
code_monk On BEST ANSWER

I think regex is a very appropriate technology for this. It's sanitisation, after all. Not grammer or syntax correction.

function sanitize_sentence($i) {

    $o = $i;

    //  There can be only one or three consecutive periods . or ...
    $o = preg_replace('/\.{4,}/','… ',$o);
    $o = preg_replace('/\.{2}/','. ',$o);

    //  There can be only one consecutive ","
    $o = preg_replace('/,+/',', ',$o);

    //  There can be only one consecutive "!"
    $o = preg_replace('/\!+/','! ',$o);

    //  There can be only one consecutive "?"
    $o = preg_replace('/\?+/','? ',$o);  

    //  we just preemptively added a bunch of spaces.
    //  Let's remove any spaces between punctuation marks we may have added
    $o = preg_replace('/([^\s\w])\s+([^\s\w])/', '$1$2', $o);

    //  A letter cannot repeat itself more than 2 times in a word
    $o = preg_replace('/(\w)\1{2,}/','$1$1',$o);

    //  Extra spaces should be eliminated
    $o = preg_replace('/\s+/', ' ', $o);
    $o = trim($o);

    // we want three literal periods, not an ellipsis char
    $o = str_replace('…','...',$o);

    return $o;
}
2
OnlineCop On

I think I'll answer the questions one at a time, since it makes more sense to focus on a single task at a time instead of munging them all together.

For #5, I suggest ([a-z])(\1{0,1})\1* replaced with $1$2 as seen in this example.

It takes the input

     Helloooooo my frieeend!!!What are you doing??    Tell me what you like...........,please. 

and produces output

     Helloo my frieend!!!What are you doing??    Tell me what you like...........,please. 
0
OnlineCop On

For #1 (. or ...), (?<!\.)(\.{3}|\.)\.*\s* can be replaced with $1 (note the trailing space) as can seen in this example.

This takes

     Helloooooo my frieeend!!!What are you doing??    Tell me what you like...........,please. 

and produces the output

     Helloooooo my frieeend!!!What are you doing??    Tell me what you like... ,please. 

As you can see, you'll get a funky ... , character, which is one more thing you may need to check for. You can check for the occurrence of ., before you do this cleanup or . , (space between) afterward, unless you have another rule that you wish to utilize to remove multiple punctuation occurrences.

The generated code for this, from the regex101.com site, is the following:

$re = "/(?<!\\.)(\\.{3}|\\.)\\.*\\s*/"; 
$str = "     Helloooooo my frieeend!!!What are you doing??    Tell me what you like...........,please. "; 
$subst = "$1 "; 
$result = preg_replace($re, $subst, $str);
0
OnlineCop On

For #2, #3 and #4, you can search for ([,?!])\1+\s* and replace with $1 (note the space afterward) as in this example.

This takes

     Helloooooo my frieeend!!!What are you doing??    Tell me what you like...........,please. 

and produces

     Helloooooo my frieeend! What are you doing? Tell me what you like...........,please. 

The generated code would look like:

$re = "/([,?!])\\1+\\s*/"; 
$str = "     Helloooooo my frieeend!!!What are you doing??    Tell me what you like...........,please. "; 
$subst = "$1 "; 
$result = preg_replace($re, $subst, $str);