This is my current sentence sanitizing function:
# sanitize sentence
function sanitize_sentence($string) {
$string = preg_replace("/(?<!\d)[.,!?](?!\d)/", '$0 ', $string); # word,word. > word, word.
$string = preg_replace("/(^\s+)|(\s+$)/us", "", preg_replace('!\s+!', ' ', $string)); # " hello hello " > "hello hello"
return $string;
}
Running some tests with this string:
$string = ' Helloooooo my frieeend!!!What are you doing?? Tell me what you like...........,please. ';
The result is:
echo sanitize_sentence($string);
Helloooooo my frieeend! ! ! What are you doing? ? Tell me what you like. . . . . . . . . . . , please.
As you can see, I already managed to resolve some of the requirements, but i'm still stuck with some details. The final result should be:
Helloo my frieend! What are you doing? Tell me what you like..., please.
Which means, that all these requirements should be accomplished:
- There can be only one or three consecutive periods . or ...
- There can be only one consecutive comma ,
- There can be only one consecutive question mark ?
- There can be only one consecutive exclamation mark !
- A letter cannot repeat itself more than 2 times in a word. E.g.: mass (right), masss (wrong, and should be converted to mass)
- A space should be added always after these characters .,!? This is already working fine!
- In the case of 3 consecutive periods, the space is added only after the last period.
- Extra spaces (more than one space) should be eliminated and trimmed form both ends of the sentences. This is already working fine!
I think regex is a very appropriate technology for this. It's sanitisation, after all. Not grammer or syntax correction.