How can I split sentences in paragraphs based on the period(.)? Using stanford parser

763 views Asked by At

How can I split sentences in paragraphs based on the period(.)? I want to use a Stanford Parser(Java).

For instance, I have a paragraphs.

Your skills of writing Paragraph will make you a perfect man. If you look at any printed prose book, you will see that each chapter is divided up into sections, the first line of each being indented slightly to the right. These sections are called Paragraph. Chapters, essays and other prose compositions are broken up into paragraphs, to make the reading of them easier.

After splitting,

Your skills of writing Paragraph will make you a perfect man.

If you look at any printed prose book, you will see that each chapter is divided up into sections, the first line of each being indented slightly to the right.

These sections are called Paragraph.

Chapters, essays and other prose compositions are broken up into paragraphs, to make the reading of them easier.

I hope to get this result. How can I get this result by using Stanford Parser?

1

There are 1 answers

0
DevilsHnd - 退した On BEST ANSWER

You don't need to bring in a special parser to do this when you already have the String.split() method. You just need to utilize the proper Regular Expression (RegEx) to carry out the task.

Sentences within a paragraph may not just contain a period at the end of it. There could be a Question Mark (?) or perhaps an Exclamation Mark (!) at the end of the sentence. To truly pull out all sentences from a paragraph you will need to consider this. Another thing to consider, What if there is a numerical value which happens to go to a specific decimal point within the sentence like:

"Hey folks, listen to this. The value of the item was $123.45 and guess what, she paid all of it in one shot! That www.ebay.com is a real great place to get stuff don't you think? I think I'll stick with www.amazon.com though. I'm not hooked on it but they've treated me great for years."

Now looking at the small paragraph above you can clearly see some things within it that need to be obviously considered when splitting it into individual sentences. We can't just base everything from a period (.). We don't really want to split monetary values and web domains, and, we don't what question or exclamation sentences included into other sentences.

To break down this example paragraph into individual sentences without damaging content with the String.split() method we can use this Regular Expression:

String[] sentences = paragraph.trim().split("(?<=\\.\\s)|(?<=[?!]\\s)");

Did you notice that we used the String.trim() method here as well? Some paragraphs can start with a Tab or spaces so we just get rid of those right off the start before the split is carried out (just in case). The Regular Expression used (which utilizes Positive Look-Behind) within the String.split() method isn't really all that complicated and you can test it here. Here is what it's about:

enter image description here

If you were to now iterate through the String Array variable named sentences like this:

for (String sentence : sentences) {
    System.out.println(sentence + " \n");
}

your console output should look something like:

Hey folks, listen to this.  

The value of the item was $123.45 and guess what, she paid all in one shot!  

That www.ebay.com is a real great place to get stuff don't you think?  

I think I'll stick with www.amazon.com though.  

I'm not hooked on it but they've treated me great for years.