I am using SpaCy to find patterns in texts. For some patterns such as single words this is straightfoward, and I am happy with the results. For example,
import re
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("Month", [[{"TEXT": "January"}]])
doc = nlp("The date 1 January 2022 can also be written as 1/1/2022.")
for match_id, start, end in matcher(doc):
match_id_string = matcher.vocab.strings[match_id]
span = doc[start:end]
print(match_id_string, span.text, start, end)
will find "January" as the fourth token in doc
and recognize it as a "Month" pattern.
For patterns that are dates things are a little more complicated. I use that SpaCy allows to search for patterns that are formulated as regular expressions. Adding the code
matcher.add("Date", [[{"TEXT": {"REGEX": "1/1/2022"}}]])
where "1/1/2022" is a (very simple) regular expression,
will find the date "1/1/2022" in the above defined doc
and recognize it as a "Date" pattern.
But adding
matcher.add("Date", [[{"TEXT": {"REGEX": "1 January 2022"}}]])
will not find the date "1 January 2022".
As explained on SpaCy's website
this is because the matcher only matches on single tokens.
The solution provided by SpaCy is to
"match on the doc.text
with re.finditer
":
for match in re.finditer("1 January 2022", doc.text):
start, end = match.span()
span = doc.char_span(start, end)
print(span.text, start, end)
This finds "1 January 2022" as the character range from indices 9 to 23.
However, I would like to put the matches found using re.finditer
into the usual SpaCy format of matches,
which is a 3-tuple containing a match ID and the start and end indices of tokens
rather than start and end indices of a character range.
Question: How can I transform these character indices to token indices?
Does SpaCy provide a method for doing this? I guess that would be ideal, but I didn't find one. Otherwise, are there any other clever tools for doing this? I could try to make one more or less from scratch, but that feels like reinventing the wheel.
You can just get the token indices from the span.
Additionally, if you are trying to match a literal phrase, you can just use the PhraseMatcher, which is already supported by the EntityRuler - you just pass a string as a pattern instead of a dictionary.