php sentence boundaries detection

8.6k views Asked by At

I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?

6

There are 6 answers

0
LeMoussel On

@ridgerunner I wrote your PHP code in C #

I get like 2 sentences as result :

  • Mr. J. Dujardin régle sa T.V.
  • A. en esp. uniquement

The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement

and with our test paragraph

string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";

The result is

index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!

C# code :

                string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
                Regex rx = new Regex(@"(\S.+?
                                       [.!?]               # Either an end of sentence punct,
                                       | [.!?]['""]         # or end of sentence punct and quote.
                                       )
                                       (?<!                 # Begin negative lookbehind.
                                          Mr.                   # Skip either Mr.
                                        | Mrs.                  # or Mrs.,
                                        | Ms.                   # or Ms.,
                                        | Jr.                   # or Jr.,
                                        | Dr.                   # or Dr.,
                                        | Prof.                 # or Prof.,
                                        | Sr.                   # or Sr.,
                                        | \s[A-Z].              # or initials ex: George W. Bush,
                                        | T\.V\.A\.             # or "T.V.A."
                                       )                    # End negative lookbehind.
                                       (?=|\s+|$)", 
                                       RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
                foreach (Match match in rx.Matches(sText))
                {
                    Console.WriteLine("index: {0}  sentence: {1}", match.Index, match.Value);
                }
2
Trav On

As a low-tech approach, you might want to consider using a series of explode calls in a loop, using ., !, and ? as your needle. This would be very memory and processor intensive (as most text processing is). You would have a bunch of temporary arrays and one master array with all found sentences numerically indexed in the right order.

Also, you'd have to check for common exceptions (such as a . in titles like Mr. and Dr.), but with everything being in an array, these types of checks shouldn't be that bad.

I'm not sure if this is any better than regex in terms of speed and scaling, but it would be worth a shot. How big are these blocks of text you want to break into sentences?

0
jisaacstone On

I was using this regex:

preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);

Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses

explode('.',$text);

because I decided speed was more important than accuracy.

1
user723220 On

Build a list of abbreviations like this

$skip_array = array ( 

'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.

Compile them into a an expression

$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}

Last run this preg_split to break into sentences.

$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
                     $txt, -1, PREG_SPLIT_NO_EMPTY);

And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.<p></p> If you have situations.Like this where.They stick together it becomes immensely more difficult to parse.

7
ridgerunner On

An enhanced regex solution

Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:

<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
    # Split sentences on whitespace between them.
    # See: http://stackoverflow.com/a/5844564/433790
    (?<=          # Sentence split location preceded by
      [.!?]       # either an end of sentence punct,
    | [.!?][\'"]  # or end of sentence punct and quote.
    )             # End positive lookbehind.
    (?<!          # But don\'t split after these:
      Mr\.        # Either "Mr."
    | Mrs\.       # Or "Mrs."
    | Ms\.        # Or "Ms."
    | Jr\.        # Or "Jr."
    | Dr\.        # Or "Dr."
    | Prof\.      # Or "Prof."
    | Sr\.        # Or "Sr."
    | T\.V\.A\.   # Or "T.V.A."
                 # Or... (you get the idea).
    )             # End negative lookbehind.
    \s+           # Split on whitespace between sentences,
    (?=\S)        # (but not at end of string).
    %xi';  // End $split_sentences.

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! '; // Note ws at end.

$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:

This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!" The T.V.A. is a big project!

Here is the output from the script:

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]

The essential regex solution

The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just: /(?<=[.!?])\s+(?=\S)/.

Edit: 20130820_1000 Added T.V.A. (another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)

Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.

1
clutterjoe On

Slight improvement on someone else's work:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?]             # Either an end of sentence punct,
| [.!?][\'"]        # or end of sentence punct and quote.
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| Sr\.              # or "Sr.",
| \s[A-Z]\.              # or initials ex: "George W. Bush",
                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);