How to transform character indices to SpaCy token indices?

Question

How to transform character indices to SpaCy token indices?

888 views Asked by Bart At 26 December 2024 at 02:17

I am using SpaCy to find patterns in texts. For some patterns such as single words this is straightfoward, and I am happy with the results. For example,

import re
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("Month", [[{"TEXT": "January"}]])

doc = nlp("The date 1 January 2022 can also be written as 1/1/2022.") 

for match_id, start, end in matcher(doc):
    match_id_string = matcher.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id_string, span.text, start, end)

will find "January" as the fourth token in doc and recognize it as a "Month" pattern.

For patterns that are dates things are a little more complicated. I use that SpaCy allows to search for patterns that are formulated as regular expressions. Adding the code

matcher.add("Date", [[{"TEXT": {"REGEX": "1/1/2022"}}]])

where "1/1/2022" is a (very simple) regular expression, will find the date "1/1/2022" in the above defined doc and recognize it as a "Date" pattern. But adding

matcher.add("Date", [[{"TEXT": {"REGEX": "1 January 2022"}}]])

will not find the date "1 January 2022". As explained on SpaCy's website this is because the matcher only matches on single tokens. The solution provided by SpaCy is to "match on the doc.text with re.finditer":

for match in re.finditer("1 January 2022", doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    print(span.text, start, end)

This finds "1 January 2022" as the character range from indices 9 to 23.

However, I would like to put the matches found using re.finditer into the usual SpaCy format of matches, which is a 3-tuple containing a match ID and the start and end indices of tokens rather than start and end indices of a character range.

Question: How can I transform these character indices to token indices?

Does SpaCy provide a method for doing this? I guess that would be ideal, but I didn't find one. Otherwise, are there any other clever tools for doing this? I could try to make one more or less from scratch, but that feels like reinventing the wheel.

Original Q&A

There are 1 answers

**polm23** · Accepted Answer · 2022-01-04T04:25:47+00:00

You can just get the token indices from the span.

for match in re.finditer("1 January 2022", doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    
    match = (match_id, span[0].i, span[-1].i + 1)

Additionally, if you are trying to match a literal phrase, you can just use the PhraseMatcher, which is already supported by the EntityRuler - you just pass a string as a pattern instead of a dictionary.

TechQA.

How to transform character indices to SpaCy token indices?

There are 1 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in SPACY

Popular Questions

Popular Tags

Trending Questions