How to break up document by sentences with Spacy

35k views Asked by At

How can I break a document (e.g., paragraph, book, etc) into sentences.

For example, "The dog ran. The cat jumped" into ["The dog ran", "The cat jumped"] with spacy?

6

There are 6 answers

4
npit On BEST ANSWER

The up-to-date answer is this:

from __future__ import unicode_literals, print_function
from spacy.lang.en import English # updated

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe('sentencizer')
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]
1
Ulad Kasach On

From spacy's github support page

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
0
petezurich On

For current versions (e.g. 3.x and above) use the code below for optimal results with the statistical model rather than the rule based sentencizer component.

Also note that you can speed up processing and reduce the memory footprint if you include only the pipeline components that are needed for sentence separation.

import spacy

# instantiate pipeline with any model of your choosing
nlp = spacy.load("en_core_web_sm")

text = "The dog ran. The cat jumped. The 2. fox hides behind the house."

# only select necessary pipeline components to speed up processing
with nlp.select_pipes(enable=['tok2vec', "parser", "senter"]):
    doc = nlp(text)
    
for sentence in doc.sents:
    print(sentence)
2
user8189050 On

With spacy 3.0.1 they changed the pipline.

from spacy.lang.en import English 

nlp = English()
nlp.add_pipe('sentencizer')


def split_in_sentences(text):
    doc = nlp(text)
    return [str(sent).strip() for sent in doc.sents]
0
notDarkMatter On

Updated to reflect the comments in the first answer

from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe('sentencizer')
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]
0
KB_ On

Answer

import spacy
nlp = spacy.load('en_core_web_sm')

text = 'My first birthday was great. My 2. was even better.'
sentences = [i for i in nlp(text).sents]

Additional info
This assumes that you have already installed the model "en_core_web_sm" on your system. If not, you can easily install it by running the following command in your terminal:

$ python -m spacy download en_core_web_sm

(See here for an overview of all available models.)

Depending on your data this can lead to better results than just using spacy.lang.en.English. One (very simple) comparison example:

import spacy
from spacy.lang.en import English

nlp_simple = English()
nlp_simple.add_pipe(nlp_simple.create_pipe('sentencizer'))

nlp_better = spacy.load('en_core_web_sm')


text = 'My first birthday was great. My 2. was even better.'

for nlp in [nlp_simple, nlp_better]:
    for i in nlp(text).sents:
        print(i)
    print('-' * 20)

Outputs:

>>> My first birthday was great.
>>> My 2.
>>> was even better.
>>> --------------------
>>> My first birthday was great.
>>> My 2. was even better.
>>> --------------------