I would like to group all the merchant transactions from a single table, and just get a count. The problem is, the merchant, let's say redbox, will have a redbox plus the store number added to the end(redbox 4562,redbox*1234). I will also include the category for grouping purpose.
Category Merchant
restaurant bruger king 123 main st
restaurant burger king 456 abc ave
restaurant mc donalds * 45877d2d
restaurant mc 'donalds *888544d
restaurant subway 454545
travelsubway MTA
gas station mc donalds gas
travel nyc taxi
travel nyc-taxi
The question: How can I group the merchants when they have address or store locations added on to them.All I need is a count for each merchant.
The short answer is there is no way to accurately do this, especially with just pure SQL.
You can find exact matches, and you can find wildcard matches using the
LIKE
operator or a (potentially huge) series of regular expressions, but you cannot find similar matches nor can you find potential misspellings of matches.There's a few potential approaches I can think of to solve this problem, depending on what type of application you're building.
First, normalize the merchant data in your database. I'd recommend against storing the exact, unprocessed string such as Bruger King in your database. If you come across a merchant that doesn't match a known set of merchants, ask the user if it already matches something in your database. When data goes in, process it then and match it to an existing known merchant.
Store a similarity coefficient. You might have some luck using something like a Jaccard index to judge how similar two strings are. Perhaps after stripping out the numbers, this could work fairly well. At the very least, it could allow you to create a user interface that can attempt to guess what merchant it is. Also, some database engines have full-text indexing operators that can descibe things like similar to or sounds like. Those could potentially be worth investigating.
Remember merchant matches per user. If a user corrects bruger king 123 main st to Burger King, store that relation and remember it in the future without having to prompt the user. This data could also be used to help other users correct their data.
But what if there is no UI? Perhaps you're trying to do some automated data processing. I really see no way to handle this without some sort of human intervention, though some of the techniques described above could help automate this process. I'd also look at the source of your data. Perhaps there's a distinct merchant ID you can use as a key, or perhaps there exists somewhere a list of all known merchants (maybe credit card companies provide this API?) If there's boat loads of data to process, another option would be to partially automate it using a service such as Amazon's Mechanical Turk.