TSQL grouping on fuzzy column

155 views Asked by At

I would like to group all the merchant transactions from a single table, and just get a count. The problem is, the merchant, let's say redbox, will have a redbox plus the store number added to the end(redbox 4562,redbox*1234). I will also include the category for grouping purpose.

Category      Merchant
restaurant    bruger king 123 main st
restaurant    burger king 456 abc ave
restaurant    mc donalds * 45877d2d
restaurant    mc 'donalds *888544d
restaurant    subway 454545
travelsubway  MTA
gas station   mc donalds gas
travel        nyc taxi
travel        nyc-taxi

The question: How can I group the merchants when they have address or store locations added on to them.All I need is a count for each merchant.

3

There are 3 answers

0
Mike Christensen On BEST ANSWER

The short answer is there is no way to accurately do this, especially with just pure SQL.

You can find exact matches, and you can find wildcard matches using the LIKE operator or a (potentially huge) series of regular expressions, but you cannot find similar matches nor can you find potential misspellings of matches.

There's a few potential approaches I can think of to solve this problem, depending on what type of application you're building.

First, normalize the merchant data in your database. I'd recommend against storing the exact, unprocessed string such as Bruger King in your database. If you come across a merchant that doesn't match a known set of merchants, ask the user if it already matches something in your database. When data goes in, process it then and match it to an existing known merchant.

Store a similarity coefficient. You might have some luck using something like a Jaccard index to judge how similar two strings are. Perhaps after stripping out the numbers, this could work fairly well. At the very least, it could allow you to create a user interface that can attempt to guess what merchant it is. Also, some database engines have full-text indexing operators that can descibe things like similar to or sounds like. Those could potentially be worth investigating.

Remember merchant matches per user. If a user corrects bruger king 123 main st to Burger King, store that relation and remember it in the future without having to prompt the user. This data could also be used to help other users correct their data.

But what if there is no UI? Perhaps you're trying to do some automated data processing. I really see no way to handle this without some sort of human intervention, though some of the techniques described above could help automate this process. I'd also look at the source of your data. Perhaps there's a distinct merchant ID you can use as a key, or perhaps there exists somewhere a list of all known merchants (maybe credit card companies provide this API?) If there's boat loads of data to process, another option would be to partially automate it using a service such as Amazon's Mechanical Turk.

1
Sewder On

You can use LIKE

SELECT COUNT(*) AS "COUNT", "BURGER KING" 
FROM <tables>
WHERE restaurant LIKE "%king%"

UNION ALL
SELECT COUNT(*) AS "COUNT", "JACK IN THE BOX" 
FROM <tables>
Where resturant LIKE "jack in the box%"

You may have to move the wildcards around depending on how the records were spelled out.

1
Jim Sosa On

It depends a bit on what database you use, but most have some kind of REGEXP_INSTR or other function you can use to check for the first index of a pattern. You can then write something like this

SELECT SubStr(merchant, 1, REGEXP_INSTR(merchant, '[0-9]')), count('x')
  FROM Expenses
 GROUP BY SubStr(merchant, 1, REGEXP_INSTR(merchant, '[0-9]'))

This assumes that the merchant name doesn't have a number and the store number does. However you still may need to strip out any special chars with a replace (like *, -, etc).