OCR serial number CRC, check algorithm

1.6k views Asked by At

I want to read a serial number with variable length and base-36 encoded with the use of Optical Character Recognition. Since errors can occur, one characters can be appended for Cycle Redundancy Check or as a checksum. One characters based-36 encoded can hold up 36 different numbers as checksum.

Serial number example: LNG TZ746B8 + one check character (base-36)

At the moment I have tried different algorithms, like the Luhn Mod N (36) algorithm.

CRC

CRC is most often used for bit-error on network transmissions detection, where the highest probability is that one bit changes. In the theme of OCR serial number recognition a more likely recognition-error is a false-recognition of a Ascii "8" character (binary 00111000)instead of a "B" character (binary 01000010) results in more than just a one bit error. By the CRC the Hamming Distance (HD) is used as parameter for error detection. Within the before described 0x05 polynomial for CRC division a HD of about 3-4 is used, resulting in an undetected error, if more bits change. HD Source

Which polynomial would be the best or which kind of checksum algorithm could be used? At the moment I achieve the best results using the boost CRC implementation and 0x05 as polynomial division for a 5-bit CRC (36 characters needs 6 bits, but the last bit is not full used. For testing I use only full 5 bits = 31 different characters). List of polynomials Wiki Polynomials

keywords

What kind of keywords can I use to find informations in the internet regarding the topic of error detection in optical character recognition systems? Where can I find some statistic of the most likely recognition errors using OCR? (Like E/3, B/8, errors.)

Problem in the use of CRC

36 different numbers needs 7 bits, whereby not all of the 7 bits are used. 7 bits can hold up to 63 numbers. Therefore I have to modulo (%) the result of the CRC or only use a 6-bit polynomial. Based on this my result accuracy drops when I use a modulo. Reason for this is that more bits can hold more different checksums, and therefore less collisions occur.

Further I have the problem, that with OCR special characters like N/M, 3/E, B78 can easily be false recognized. Since my requirement is to recognize all characters to 100% correct the checksum or CRC algorithm is introduced to prevent a false-recognition without detection.

Further problem is now, that different serial numbers, for example the serial number "S95I" and "5951" result in the same checksum "GP". Since OCR is prone to 5/S false recognition errors, such a checksum ollision should not occur.

Further examples for same character checksum.

Nr.1 - Nr.2 - Checksum 1 and 2
BSHB - 85H8 - KA - KA
BSJ8 - 85JB - IC - IC
BSJ8 - 85JB - IC - IC
BSQ8 - 85QB - KC - KC
BSQ8 - 85QB - KC - KC
BSSB - 85S8 - IA - IA
BS1B - 8518 - ES - ES
BS38 - 853B - GQ - GQ
BB7I - 8871 - E0 - E0
BB7I - 8871 - E0 - E0
B930 - 89EO - BI - BI
9331 - 9EEI - EQ - EQ
9EEI - 9331 - EQ - EQ

In my implementation I use the boost CRC algorithm, which can be found here Boost CRC:

string data = "S95I";
boost::crc_optimal<11, 0x571> crc;
crc.process_bytes(data.data(), data.size());
stringstream checksum;
checksum << (int)crc() % 1296;
string resultCheck = checksum.str();

I would like to know if there are other algorithms with lower possibility of collisions, or other possibilities for check sum implementations.

If there are any questions, or I haven't explained myself not well enough, please don't hesitate to answer. I will give my best to reply as soon as possible.

Thank you very much, Christoph

1

There are 1 answers

3
Mark Adler On

You don't have enough characters in your check value. Far and away the best thing you can do to reduce the collisions is add more check characters. You should have at least four.

I was able to reduce the number of collisions by 20% in the four-character case using a Fletcher-derived check instead of a CRC, thusly:

unsigned long check(char *seq, size_t len)
{
    unsigned ch;
    unsigned long sum = 1, val = 0;

    while (len) {
        ch = seq[--len];
        if (ch <= '9')
            ch -= '0';
        else
            ch -= 'A' - 10;
        sum += ch;
        val += sum;
    }
    return (sum % 36) + 36 * (val % 36);
}

I further reduced the number of collisions by a factor of about six by biasing the check to only distinguish the characters likely to have a recognition error, thusly:

unsigned long check(char *seq, size_t len)
{
    unsigned ch;
    unsigned long sum = 1, val = 0;
    static char incl[256] = {
        ['0'] = 1, ['1'] = 2, ['3'] = 3, ['5'] = 4, ['8'] = 5, ['M'] = 6,
        ['B'] = 7, ['E'] = 8, ['I'] = 9, ['N'] = 10, ['O'] = 11, ['S'] = 12
    };

    while (len) {
        ch = incl[(int)(seq[--len])];
        sum += ch;
        val += sum;
    }
    return (sum % 36) + 36 * (val % 36);
}