I would like to implement some text manipulation as a pre-processing to keyphrases extraction. Look at the below example:
import spacy
text = "conversion of existing underground gas storage facilities into storage facilities dedicated to hydrogen-storage"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for token in doc:
print(f'{token.text:{8}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}')
result:
conversion NOUN NN ROOT noun noun, singular or mass
of ADP IN prep adposition conjunction, subordinating or preposition
existing VERB VBG amod verb verb, gerund or present participle
underground ADJ JJ amod adjective adjective (English), other noun-modifier (Chinese)
gas NOUN NN compound noun noun, singular or mass
storage NOUN NN compound noun noun, singular or mass
facilities NOUN NNS pobj noun noun, plural
into ADP IN prep adposition conjunction, subordinating or preposition
storage NOUN NN compound noun noun, singular or mass
facilities NOUN NNS pobj noun noun, plural
dedicated VERB VBN acl verb verb, past participle
to ADP IN prep adposition conjunction, subordinating or preposition
hydrogen NOUN NN compound noun noun, singular or mass
- PUNCT HYPH punct punctuation punctuation mark, hyphen
storage NOUN NN pobj noun noun, singular or mass
I would like to recognize when a given word (for example storage) is preceeded by a NOUN (like in the case of gas storage) in order to replace the space characted with an hyphen (as already done in hydrogen-storage), but I don't want to change the space character when my word is preceeded by a POS element that is not NOUN (example: into storage).
Expected output: "conversion of existing underground gas-storage facilities into storage facilities dedicated to hydrogen-storage"
Is there an efficient way to do this?
Thank you in advance for any help
spaCy provides a rule-based matcher. It lets you define rules to find patterns like a noun followed by a noun.
...which you can use to extract matching sequences (this is pretty much verbatim from the spaCy docs):
The output for your text is
Now there is also functionality to merge tokens using the
retokenizer.mergemethod, but that does not work in this case - see below.In your case, there are overlapping spans ("gas storage" and "storage facilities" overlap) which result in a
ValueError: [E102] Can't merge non-disjoint spans.. You'd have to make sure you only get non-overlapping span if you want to use spaCy, e.g., by changing the pattern to "a noun, followed by a singular noun" ([{"POS": "NOUN"}, {"TAG": "NN"}]), which would work and give the following result:If you only need the string, I'd recommend to use the matcher as demonstrated above to find spans and then use a custom function to merge tokens based on these spans, which should be more flexible than the builtin retokenizer.