In comparing the similarity of 2 strings, I want to exclude a list of strings, for example, ignore 'Texas', and 'US'.
I tried to use the argument 'isjunk' in Difflib's SequenceMatcher:
exclusion = ['Texas', 'US']
sr = SequenceMatcher(lambda x: x in exclusion, 'Apple, Texas, US', 'Orange, Texas, US', autojunk=True).ratio()
print (sr)
The similarity ratio is high as 0.72, so obviously it's not excluding the strings unwanted.
What is the right way to do this?
I'm not familiar with the package, but as a curious person I googled it a bit, and explored it a bit with some self examples. I found something interesting, which is not a solution to your problem, it is more an excuse to the results you were recieved.
as I found here:
so let's take a look of an example:
I got this:
So then for this example, I would expected to get the same result:
I was expected that the extra
TexasUS
will ignored since it inexclusion
list, and then theratio
will remain the same, let's see what we got:the ration is less than the first example, it does not make any sense. but if we will take a deep look at the output we will see that the matches are totally the same! so what the differences? the length of the strings (it calculate it along with the excluded strings)! if we will stick the naming convention from the link,
T
is bigger now:I can suggest you to filter the words by yourself before match them as like here:
Hope you'll find it useful, maybe not to solve your problem, but to understand it (understanding an issue is the first step to the solution!)