For integer/dates values annotated using Prodigy, does the spaCy model learn the range of values as well?

155 views Asked by At

I have a prodigy session set up to annotate certain numeric values in a document for age (ranges from 0 to 100). I am only annotating the number. My question is, suppose there is a corrupt value which crept in (age being 1000 or 22.7), will the model understand that even though it is close to the age text in the document, it should not be picked up?

In other words, can it learn the range of integer values, and if it does, will that work for date format as well? For instance a date in the format dd/mm/yyyy which is DOB (all the annotated ones are < 01/01/2000) and there is a date 31/12/2020, will that get picked up as well since all the annotated dates are nowhere close to this range?

Thank you

1

There are 1 answers

2
polm23 On BEST ANSWER

Good question! spaCy does not internally represent numeric tokens as numbers, so it doesn't have an explicit concept of the values. In that sense it can't tell between valid and invalid values for age.

However, spaCy does use "shape" features when representing tokens that will help it recognize valid ages. There are different kinds of shape tokens, but the one spaCy uses will represent words by converting characters to a representation of the character type. It works like this:

  • spaCy → xxxXx
  • fish → xxxx
  • Fish → Xxxx
  • 23 → dd
  • 1000 → dddd
  • 22.7 → dd.d

Because of this you could expect that spaCy learns that two-digit numbers are likely to be ages, but numbers with decimals or four digits aren't likely. On the other hand, this doesn't help it differentiate between 100 and 999.

For dates this will not help with determining valid or invalid birthdates. Shape is just one of spaCy's features, but other features like prefix and suffix aren't really going to help with this either.

Since it's easy to verify numeric values in code, what I would suggest is matching broadly in spaCy and then using your own function to check whether dates or ages are valid by parsing them.


Outside of spaCy in particular, the question of how NLP models represent numeric values is actually an increasingly popular research topic - if you'd like to know more about it this is a recent article on the topic: Do Language Models Know How Heavy an Elephant Is?