how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

1.7k views Asked by At

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent. Following is an example,

##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai

I have added multiple sentences like this. At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.

I want to resolve this issue. I am allowed to change only training data, also restricted not to write any custom component for this.

4

There are 4 answers

0
John On

First of all, add samples for the most common typos for your entities as advised here

Beyond this, you need a spellchecker.

I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo. Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues. Universal Encoder is another solution. There should be more options for spell correction, but you will need to write code in any way.

0
Akela Drissner On

To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:

##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place) in Chennai
 - [hetel](place) in Berlin please

Once you've added enough examples, the model should be able to generalise from the sentence structure.

If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already

0
Merve Noyan On

One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa. There's a working implementation of fuzzy wuzzy as a custom component:

class FuzzyExtractor(Component):
    name = "FuzzyExtractor"
    provides = ["entities"]
    requires = ["tokens"]
    defaults = {}
    language_list  ["en"]
    threshold = 90

    def __init__(self, component_config=None, *args):
        super(FuzzyExtractor, self).__init__(component_config)

    def train(self, training_data, cfg, **kwargs):
        pass

    def process(self, message, **kwargs):

        entities = list(message.get('entities'))

        # Get file path of lookup table in json format
        cur_path = os.path.dirname(__file__)
        if os.name == 'nt':
            partial_lookup_file_path = '..\\data\\lookup_master.json'
        else:
            partial_lookup_file_path = '../data/lookup_master.json'
        lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)

        with open(lookup_file_path, 'r') as file:
            lookup_data = json.load(file)['data']

            tokens = message.get('tokens')

            for token in tokens:

                # STOP_WORDS is just a dictionary of stop words from NLTK
                if token.text not in STOP_WORDS:

                    fuzzy_results = process.extract(
                                             token.text, 
                                             lookup_data, 
                                             processor=lambda a: a['value'] 
                                                 if isinstance(a, dict) else a, 
                                             limit=10)

                    for result, confidence in fuzzy_results:
                        if confidence >= self.threshold:
                            entities.append({
                                "start": token.offset,
                                "end": token.end,
                                "value": token.text,
                                "fuzzy_value": result["value"],
                                "confidence": confidence,
                                "entity": result["entity"]
                            })

        file.close()

        message.set("entities", entities, add_to_output=True)

But I didn't implement it, it was implemented and validated here: Rasa forum Then you will just pass it to your NLU pipeline in config.yml file.

1
Thusitha On

Its a strange request that they ask you not to change the code or do custom components.

The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:

 ##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place:hotel) in Chennai
 - [hetel](place:hotel) in Berlin please

This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)