How can I use a spellcheck to correctly identify that a word is missing ñ?
I have tried to use autocorrect, but it will not detect that the ñ is missing
from autocorrect import Speller
spell = Speller(lang='es')
print(spell('gatto'))
print(spell('ano'))
print(spell('manana'))
gato
ano
manana
I have also tried spellchecker but that does not detect the word is spelt wrong
from spellchecker import SpellChecker
spell = SpellChecker(language='es')
misspelled = ["gatto", "manana", "ano"]
misspelled = spell.unknown(misspelled)
for word in misspelled:
print(word, spell.correction(word))
gatto gato
The data for
autocorrectlistsmananaas a correct word, which is why it is not getting corrected.anois a valid word with a somewhat different meaning fromaño, and a simple spellchecker can't know you don't mean that.gattodoesn't, and shouldn't, includeñ. However:Now, as to why
mananais in the dictionary, I can't know for sure — that is a question either for the native speakers, or for the person who created the frequency data that the module uses. According to that data (version downloaded at the time of this answer),mañanawas found40238times, andmanana,1853— much less common, but existing. Similarly,Españais1356943andEspanais8297.The way
autocorrectpackage works is, if a word being tested is itself a candidate (i.e. if it was found in the frequency list), it is unchanged. If not, then the most frequent among the one-typo candidates is returned. If that too fails, andfast=False, then the two-typo candidates are checked. Sincemananais itself in the word list, even though it is much less common thanmañana, it will be returned.The
README.mdof theautocorrectdoes not specify which dataset was used to count the word frequencies, but suggests for new languages that to get "a bunch of text", "Easiest way is to download wikipedia." If Wikipedia was indeed used, then if Spanish Wikipedia includes the wordmananaeven once, it will not be autocorrected, since it will be considered correct.As to solutions:
You might create a new frequency list (according to the package's instructions) from a text corpus that you know to not include incorrect words. I am sure the author would value the pull request.
You might use a different spellchecker, for example
aspell. The Python package relies on theaspellprogram as well as the appropriate language file being installed on your system. I have not tried using the Spanish aspell, but I believe its dictionary is likely to be more correct.You might use a wordlist you know to be correct to prune the
autocorrectpackage's word frequencies. To find the latter, use this in Python: