How can I break a document (e.g., paragraph, book, etc) into sentences.
For example, "The dog ran. The cat jumped"
into ["The dog ran", "The cat jumped"]
with spacy?
How can I break a document (e.g., paragraph, book, etc) into sentences.
For example, "The dog ran. The cat jumped"
into ["The dog ran", "The cat jumped"]
with spacy?
From spacy's github support page
from __future__ import unicode_literals, print_function
from spacy.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
For current versions (e.g. 3.x and above) use the code below for optimal results with the statistical model rather than the rule based sentencizer
component.
Also note that you can speed up processing and reduce the memory footprint if you include only the pipeline components that are needed for sentence separation.
import spacy
# instantiate pipeline with any model of your choosing
nlp = spacy.load("en_core_web_sm")
text = "The dog ran. The cat jumped. The 2. fox hides behind the house."
# only select necessary pipeline components to speed up processing
with nlp.select_pipes(enable=['tok2vec', "parser", "senter"]):
doc = nlp(text)
for sentence in doc.sents:
print(sentence)
Answer
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'My first birthday was great. My 2. was even better.'
sentences = [i for i in nlp(text).sents]
Additional info
This assumes that you have already installed the model "en_core_web_sm" on your system. If not, you can easily install it by running the following command in your terminal:
$ python -m spacy download en_core_web_sm
(See here for an overview of all available models.)
Depending on your data this can lead to better results than just using spacy.lang.en.English
. One (very simple) comparison example:
import spacy
from spacy.lang.en import English
nlp_simple = English()
nlp_simple.add_pipe(nlp_simple.create_pipe('sentencizer'))
nlp_better = spacy.load('en_core_web_sm')
text = 'My first birthday was great. My 2. was even better.'
for nlp in [nlp_simple, nlp_better]:
for i in nlp(text).sents:
print(i)
print('-' * 20)
Outputs:
>>> My first birthday was great.
>>> My 2.
>>> was even better.
>>> --------------------
>>> My first birthday was great.
>>> My 2. was even better.
>>> --------------------
The up-to-date answer is this: