spaCy custom component function is never called

62 views Asked by At

I am adding a custom component to spaCy but it never gets called:

@Language.component("custom_sentence_boundaries")
def custom_sentence_boundaries(doc):
    print(".")
    for token in doc[:-1]:
        if token.text == "\n":
            doc[token.i + 1].is_sent_start = True
    return doc

nlp = spacy.load("de_core_web_sm")
nlp.add_pipe("custom_sentence_boundaries", after="parser")
nlp.analyze_pipes(pretty=True)
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

I get a result in sentences and the analyzer does list my component but my custom component seams to have no effect and I never see the dots from the print appearing...

Any ideas?

1

There are 1 answers

2
Talha Tayyab On BEST ANSWER

In the code which you have pasted:

You are doing :

nlp = spacy.load("de_core_web_sm")

However, it should be :

nlp = spacy.load("en_core_web_sm")

I tried to reproduce your code and I got the result

@Language.component("custom_sentence_boundaries")
def custom_sentence_boundaries(doc):
    print("...$...")                     # I am printing "...$..." so that it is visible easily 
    for token in doc[:-1]:
        if token.text == "\n":
            doc[token.i + 1].is_sent_start = True
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("custom_sentence_boundaries", after="parser")
nlp.analyze_pipes(pretty=True)
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

#Output

(please see at the bottom ...$... is printed and custom_sentence_boundaries is printed after parser as we have stated after="parser" in keyword argument)

============================= Pipeline Overview =============================

#   Component                    Assigns               Requires   Scores             Retokenizes
-   --------------------------   -------------------   --------   ----------------   -----------
0   tok2vec                      doc.tensor                                          False      
                                                                                                
1   tagger                       token.tag                        tag_acc            False      
                                                                                                
2   parser                       token.dep                        dep_uas            False      
                                 token.head                       dep_las                       
                                 token.is_sent_start              dep_las_per_type              
                                 doc.sents                        sents_p                       
                                                                  sents_r                       
                                                                  sents_f                       
                                                                                                
3   custom_sentence_boundaries                                                       False      
                                                                                                
4   attribute_ruler                                                                  False      
                                                                                                
5   lemmatizer                   token.lemma                      lemma_acc          False      
                                                                                                
6   ner                          doc.ents                         ents_f             False      
                                 token.ent_iob                    ents_p                        
                                 token.ent_type                   ents_r                        
                                                                  ents_per_type                 

✔ No problems found.
...$...