I would like to create an algorithm that could detect credit card numbers (CCNs) from various types of files.
The simple scenario how to find CCNs is to use regular expressions as defined:
- Visa:
^4[0-9]{12}(?:[0-9]{3})?$All Visa card numbers start with a4. New cards have 16 digits. Old cards have 13. - MasterCard:
^5[1-5][0-9]{14}$All MasterCard numbers start with the numbers51through55. All have 16 digits. - American Express:
^3[47][0-9]{13}$American Express card numbers start with34or37and have 15 digits. - Diners Club:
^3(?:0[0-5]|[68][0-9])[0-9]{11}$Diners Club card numbers begin with300through305,36or38. All have 14 digits. There are Diners Club cards that begin with5and have 16 digits. These are a joint venture between Diners Club and MasterCard, and should be processed like a MasterCard. - Discover:
^6(?:011|5[0-9]{2})[0-9]{12}$Discover card numbers begin with6011or65. All have 16 digits. - JCB:
^(?:2131|1800|35\d{3})\d{11}$JCB cards beginning with2131or1800have 15 digits. JCB cards beginning with35have 16 digits.
Then we can check found number with Luhn Mod-10 algorithm and if it fulfills the conditions we can say that we have found the CCN.
But this simple method have a very high number of false positives/negatives from my experience.
What algorithms or heuristics could be used to reduce the false positives/negatives matches? The advanced software like PCI Data Finder or Card Recon are providing more reliable results and that results definitely isn't achieved by simple regular expressions finding and Luhn check.
You could use a source like BINDB.com to purchase the BIN (Bank Identification Numbers) and thereby reduce false positives by only considering cards where the first six (or in some cases eight) digits match an existing card-issuing bank.
If you were only looking for US issued cards, you could substantially reduce this number yet with the same approach.