Improving on 'if/else' strcmp() ladder to determine a useable value

220 views Asked by At
//...
if( strcmp( str, "January" ) == 0 )
    month = 1;
else if( strcmp( str, "February") == 0 )
    month = 2;
//...

Q: Is there any more efficient way of determining that, for instance, "April" is the fourth month of the year? Repeated calls to strcmp() must be terribly inefficient, and if/else ladders tedious to code. Sometimes it's "March" and sometimes it's abbreviated as "MAR"... There must be a better way...

Putting the known strings in a sorted array of structs would allow, at least, binary searching, but still involves a lot of guesswork from the code.

3

There are 3 answers

14
Fe2O3 On BEST ANSWER

This is a Can I answer my own question? answer. Other answers are welcomed.


There are several ways of translating an arbitrary string from a finite set of strings into a concise, useable form. Most of these involve an iterative (or a sub-optimal linear) search involving repeated comparisons (that may need to account for case sensitivity.)

A response to my response to a recent question suggested "sharing" an (admittedly arcane) hashing function that, with awareness of false positives, return the month ordinal (1-12) when passed a string containing the name of a month (English) in 7-bit ASCII. The function performs primitive operations on the 2nd & 3rd character and out-pops the function's hash value of the string. Note, "January", "jan" and "JAN" all return the value 1. Likewise "feb", "FEBRUARY" and "Feb" would return the value 2.

static int monthOrd( char cp[] ) {
    return "DIE@CB@LJF@HAG@K"[ cp[1]/4&7 ^ cp[2]*2 &0xF ] &0xF;
}

The operations shown were uncovered through a "brute force" permutation of a number of primitive operations seeking a combination that would return 12 different values between 0x0 and 0xF (4 bits). The reader is encouraged to take apart each step of the mangling of the bits of the two ASCII characters. This result was not "invented", but was "discovered".

After the bits of two characters have been mangled, the value is used as an index into a string (aka "a cheap LUT") whose 12 letters A-L are positioned so that "?an" (January) will mangle to an index for the letter 'A'. Masking the low 4 bits of that letter yields the value 1 as the ordinal for the string "JANUARY"... 1 will be the return value when the function is passed variations of the string "Jan".

NB: Using this function allows the caller to check that the string is indeed "JAN", "jan", "January" as suits the application. The caller need not try to match any of the names of the other 11 months. This function WILL return the false positive value 1 for the string "Random", so the caller need only validate against a single month's name (length and case appropriate to the application.)


Bonus round:

static int wkdayOrd( char cp[] ) {
    return "65013427"[*cp/2 + ~cp[1] & 0x7] & 0x7;
}

An equivalent function that converts "Sun(day)" (case insensitive) to 1, "MON" to 2, "tue" to 3, etc...

Again, the caller must confirm the string against only ONE day's name to avoid "false positives".


While we're here, the following is an equivalent function for the "number names" from "zero" to "ten", again, case insensitive. (Number names are not abbreviated like month names or weekday names.)

static int numberOrd( char cp[] ) {
    return "@~IBAH~FCGE~~DJ~"[ ( cp[0] ^ cp[1]/2 + cp[2]*4 ) & 0xF ] & 0xF;
}

EDIT
To counter the nay-sayers, here's another one:

static int ZodiacOrd( char cp[] ) {
    return "BJGA@@HIECK@@DLF"[(cp[0]/2 ^ (cp[1]/2&1) + cp[2]*2) & 0xF] & 0xF;
}

Pass it the name (case ambivalent) of one of the twelve Zodiac signs, and it will return the ordinal of that star sign ("Aries" = 1, ...) Again, like any hashing function, there will be collisions with other strings. The caller need only subsequently check against a single known string; not twelve.


Update:
Drawn back to this for a recent task, an improvement to the month ordinal conversion revealed itself. The code above uses 16(+1) bytes for its LUT. The code below halves that amount. For posterity, the scheme is explained, with examples provided.

NB: Above returns 1-12 for valid a month name (or false positive) and 0 for other hash values in the range 0-15. Below returns 0-11 for valid month names (conforming to struct tm convention) and 12 for each of the four other hashes in the range 0-15. False positives don't magically go away.

/*
'#' denotes hashing the ASCII values of 2nd & 3rd letters

Apr: (p#r) =>  0   /want    3 = 0011 -> 0x3
Sep: (e#p) =>  1   /want    8 = 1000 -> 0x8
May: (a#y) =>  2   /want    4 = 0100 -> 0x4
---: (-#-) =>  3   /dont care = 1100 -> 0xc // 12 out of range
Mar: (a#r) =>  4   /want    2 = 0010 -> 0x2
Feb: (e#b) =>  5   /want    1 = 0001 -> 0x1
---: (-#-) =>  6   /dont care = 1100 -> 0xc // 12 out of range
Dec: (e#c) =>  7   /want   11 = 1011 -> 0xB
Oct: (c#t) =>  8   /want    9 = 1001 -> 0x9
Jun: (u#n) =>  9   /want    5 = 0101 -> 0x5
---: (-#-) => 10   /dont care = 1100 -> 0xc // 12 out of range
Aug: (u#g) => 11   /want    7 = 0111 -> 0x7
Jan: (a#n) => 12   /want    0 = 0000 -> 0x0
Jul: (u#l) => 13   /want    6 = 0110 -> 0x6
---: (-#-) => 14   /dont care = 1100 -> 0xc // 12 out of range
Nov: (o#v) => 15   /want   10 = 1010 -> 0xA

==>> uint64_t LUT = 0xAc607c59Bc12c483;
*/

#include <stdio.h>
#include <stdint.h>

int main( void ) {
    const uint64_t LUT = 0xAc607c59Bc12c483;

    char *mon[] = {
        "jan", "feb", "mar", "apr", "may", "jun",
        "JUL", "AUG", "SEP", "OCT", "NOV", "DEC",
        NULL
    };

    for( char **pm = mon; *pm; pm++ ) {
    //  char a = pm[0][0]; // irrelevant
        char b = pm[0][1];
        char c = pm[0][2];

        int bmngl  = b >> 2 & 0x7; // mangle the bits of 2nd character
        int cmngl  = c << 1;       // mangle the bits of 3rd character
        int LUTidx = bmngl ^ cmngl & 0xF; // combine and mask

        int rShift = 4 * LUTidx; // will right shift LUT 4*n bits

        int monOrd = (int)( LUT >> rShift ) & 0xF; // shift and mask

        printf("%s: hash = %2d ==> LU1 = %2d, LU2 = %2d, LU3 = %2d\n",
            pm[0],
            LUTidx,
            monOrd,
            // casting to int simulates return: int monthOrd( char *name );
            (int)( LUT >> 4 * ( ( b / 4 & 7 ) ^ ( c * 2 ) & 15 ) & 15 ),
            (int)( LUT>>4*(b/4&7^c*2&15)&15 ) // confident of precedence
        );
    }

    puts( "\nFalse positives:" );
    char *junk[] = { "COVID-19", "junk", "technical", "fopbar", NULL };

    for( char **pj = junk; *pj; pj++ ) {
        char b = pj[0][1];
        char c = pj[0][2];
        printf("%10s: LU = %2d\n", pj[0], (int)( LUT>>4*(b/4&7^c*2&15)&15 ) );
    }

    return 0;
}

Output:

jan: hash = 12 ==> LU1 =  0, LU2 =  0, LU3 =  0
feb: hash =  5 ==> LU1 =  1, LU2 =  1, LU3 =  1
mar: hash =  4 ==> LU1 =  2, LU2 =  2, LU3 =  2
apr: hash =  0 ==> LU1 =  3, LU2 =  3, LU3 =  3
may: hash =  2 ==> LU1 =  4, LU2 =  4, LU3 =  4
jun: hash =  9 ==> LU1 =  5, LU2 =  5, LU3 =  5
JUL: hash = 13 ==> LU1 =  6, LU2 =  6, LU3 =  6
AUG: hash = 11 ==> LU1 =  7, LU2 =  7, LU3 =  7
SEP: hash =  1 ==> LU1 =  8, LU2 =  8, LU3 =  8
OCT: hash =  8 ==> LU1 =  9, LU2 =  9, LU3 =  9
NOV: hash = 15 ==> LU1 = 10, LU2 = 10, LU3 = 10
DEC: hash =  7 ==> LU1 = 11, LU2 = 11, LU3 = 11

False positives:
  COVID-19: LU = 10 // not "November" or any other month name
      junk: LU =  5 // not "June" or any other month name
 technical: LU = 11 // not "December" or any other month name
    fopbar: LU = 12 // outside of range 0-11. definitely not a month name!

UPDATED Update:
Inspired by the clarity of code in the recent answer by @chrqlie, there's this... It compiles to the same amount of assembler, but doesn't have the baggage of additional data. (Requiring a 64bit machine, it's probably not suitable for MCUs.) Again, the size of the LUT is only 8 bytes...

static int monthOrd( char *mn ) {
#define HASH(a,b,c) ((((b + b + c)&15) - ((b & 3) == 3))*4)
    static const uint64_t LUT
        = (uint64_t) 1 << HASH('j','a','n')
        | (uint64_t) 2 << HASH('f','e','b')
        | (uint64_t) 3 << HASH('m','a','r')
        | (uint64_t) 4 << HASH('a','p','r')
        | (uint64_t) 5 << HASH('m','a','y')
        | (uint64_t) 6 << HASH('j','u','n')
        | (uint64_t) 7 << HASH('j','u','l')
        | (uint64_t) 8 << HASH('a','u','g')
        | (uint64_t) 9 << HASH('s','e','p')
        | (uint64_t)10 << HASH('o','c','t')
        | (uint64_t)11 << HASH('n','o','v')
        | (uint64_t)12 << HASH('d','e','c')
        ;

    return -1 + (0xF & (int)(LUT >> HASH('x', mn[1], mn[2])));
#undef HASH
}

Further update: The first version of monthOrd() calculated a 4 bit hash and returned 0 for 4/16 cases where the passed string bore no resemblance to the name of a month. (It returned 1-12 for 12/16 cases that hashed to match the name of a particular month.) This was 'improved' in a later version, returning 0-11 to follow the convention of struct tm. Unfortunately, this meant the caller would have to use strcmp() to discern "Jan(uary)" (0) from the 4 wannabees (also 0.) Not good!

The most recent version (just edited) still returns the desired 0-11 for the 12 hashed month names (including false positives, if those are a possibility). The edited version now returns -1 for the 25% of the cases where the string could not possibly be the name of a month (ie. the string's hash value shows it to be a wannnabee.)
The caller need not invoke strcmp() when monthOrd( str ) returns -1. Shaving cycles...

0
Costantino Grana On

I've checked what happens with gperf passing it all months as "January", "Jan", "JANUARY", "JAN", "january", "jan", and so on.

struct months {
char *name;
int number;
};

#define TOTAL_KEYWORDS 69
#define MIN_WORD_LENGTH 3
#define MAX_WORD_LENGTH 9
#define MIN_HASH_VALUE 3
#define MAX_HASH_VALUE 218
/* maximum key range = 216, duplicates = 0 */

#ifdef __GNUC__
__inline
#else
#ifdef __cplusplus
inline
#endif
#endif
static unsigned int
hash (register const char *str, register size_t len)
{
  static unsigned char asso_values[] =
    {
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219,  10,  80,  75,  95,   5,
      125,  95, 219, 219,   5, 219,  95,  55,  45,  60,
       60, 219,  85,  95,  50,  90,  25, 219, 219,  12,
      219, 219, 219, 219, 219, 219, 219,   0,  40,  35,
       35,  35,  40,  25, 219, 219,   0, 219,  10,  50,
        0,  25,  15, 219,  15,  35,  30,  10,  25, 219,
      219,  25, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219, 219, 219, 219, 219,
      219, 219, 219, 219, 219, 219
    };
  return len + asso_values[(unsigned char)str[2]] + asso_values[(unsigned char)str[1]] + asso_values[(unsigned char)str[0]];
}

struct months *
in_word_set (register const char *str, register size_t len)
{
  static struct months wordlist[] =
    {
      {""}, {""}, {""},
      {"jan",1},
      {""}, {""}, {""},
      {"january",1},
      {"Jan",1},
      {""}, {""}, {""},
      {"January",1},
      {"jun",6},
      {"june",6},
      {""}, {""}, {""},
      {"Jun",6},
      {"June",6},
      {""}, {""}, {""},
      {"jul",7},
      {"july",7},
      {""}, {""}, {""},
      {"Jul",7},
      {"July",7},
      {""}, {""}, {""},
      {"apr",4},
      {""},
      {"april",4},
      {""}, {""},
      {"aug",8},
      {""}, {""},
      {"august",8},
      {""},
      {"Apr",4},
      {""},
      {"April",4},
      {""}, {""},
      {"Aug",8},
      {""}, {""},
      {"August",8},
      {""},
      {"nov",11},
      {""}, {""}, {""}, {""},
      {"november",11},
      {""}, {""}, {""}, {""},
      {"JAN",1},
      {""}, {""}, {""},
      {"JANUARY",1},
      {"mar",3},
      {""},
      {"march",3},
      {""}, {""},
      {"Mar",3},
      {""},
      {"March",3},
      {""}, {""},
      {"may",5},
      {""},
      {"MAY",5},
      {""}, {""},
      {"May",5},
      {""}, {""}, {""}, {""},
      {"sep",9},
      {""}, {""}, {""}, {""},
      {"oct",10},
      {"september",9},
      {""}, {""},
      {"october",10},
      {"Nov",11},
      {""}, {""}, {""}, {""},
      {"November",11},
      {""}, {""}, {""}, {""},
      {"dec",12},
      {""}, {""}, {""}, {""},
      {"december",12},
      {""}, {""}, {""}, {""},
      {"feb",2},
      {""}, {""}, {""}, {""},
      {"february",2},
      {""}, {""}, {""}, {""},
      {"Oct",10},
      {""}, {""}, {""},
      {"October",10},
      {"NOV",11},
      {""}, {""}, {""}, {""},
      {"NOVEMBER",11},
      {""}, {""}, {""}, {""},
      {"JUN",6},
      {"JUNE",6},
      {""}, {""}, {""},
      {"Sep",9},
      {""}, {""}, {""}, {""},
      {"MAR",3},
      {"September",9},
      {"MARCH",3},
      {""}, {""},
      {"APR",4},
      {""},
      {"APRIL",4},
      {""}, {""},
      {"SEP",9},
      {""}, {""}, {""}, {""},
      {"Dec",12},
      {"SEPTEMBER",9},
      {""}, {""}, {""},
      {"December",12},
      {""}, {""}, {""}, {""},
      {"DEC",12},
      {""}, {""}, {""}, {""},
      {"DECEMBER",12},
      {""}, {""}, {""}, {""},
      {"OCT",10},
      {""}, {""}, {""},
      {"OCTOBER",10},
      {"JUL",7},
      {"JULY",7},
      {""}, {""}, {""},
      {"AUG",8},
      {""}, {""},
      {"AUGUST",8},
      {""},
      {"Feb",2},
      {""}, {""}, {""}, {""},
      {"February",2},
      {""}, {""}, {""}, {""},
      {"FEB",2},
      {""}, {""}, {""}, {""},
      {"FEBRUARY",2}
    };

  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
    {
      register unsigned int key = hash (str, len);

      if (key <= MAX_HASH_VALUE)
        {
          register const char *s = wordlist[key].name;

          if (*str == *s && !strcmp (str + 1, s + 1))
            return &wordlist[key];
        }
    }
  return 0;
}

I guess this is pretty fast and requires only one strcmp per string. This is exactly what is used in GCC for keyword checking.

A very nice introduction to gperf is available here.

5
chqrlie On

Here is a simple solution that should be more efficient and will check for proper spelling:

#include <string.h>

int get_month_number(const char *s) {
    if (s && s[0] && s[1]) {
        switch (s[2]) {
          case 'n': return !strcmp(s, "Jan") || !strcmp(s, "January")   ? 1 :
                           !strcmp(s, "Jun") || !strcmp(s, "June")      ? 6 : 0;
          case 'b': return !strcmp(s, "Feb") || !strcmp(s, "February")  ? 2 : 0;
          case 'r': return !strcmp(s, "Mar") || !strcmp(s, "March")     ? 3 :
                           !strcmp(s, "Apr") || !strcmp(s, "April")     ? 4 : 0;
          case 'y': return !strcmp(s, "May")                            ? 5 : 0;
          case 'l': return !strcmp(s, "Jul") || !strcmp(s, "July")      ? 7 : 0;
          case 'g': return !strcmp(s, "Aug") || !strcmp(s, "August")    ? 8 : 0;
          case 'p': return !strcmp(s, "Sep") || !strcmp(s, "September") ? 9 : 0;
          case 't': return !strcmp(s, "Oct") || !strcmp(s, "October")   ? 10 : 0;
          case 'v': return !strcmp(s, "Nov") || !strcmp(s, "November")  ? 11 : 0;
          case 'c': return !strcmp(s, "Dec") || !strcmp(s, "December")  ? 12 : 0;

        }
    }
    return 0;
}

If you can assume that s points to a string of at least 3 characters that represents an English month name in ASCII, abbreviated or not, using any case combination, you can simplify the above code while keeping it readable:

// return the month number 0..11 assuming argument is correct.
int get_month_number(const char *s) {
#define HASH(a,b,c)  ((((b) | 0x20) == 'a') + ((c) & 0x1f))
    switch (HASH(s[0],s[1],s[2])) {
      default:
      case HASH('J','a','n'): return 0;
      case HASH('F','e','b'): return 1;
      case HASH('M','a','r'): return 2;
      case HASH('A','p','r'): return 3;
      case HASH('M','a','y'): return 4;
      case HASH('J','u','n'): return 5;
      case HASH('J','u','l'): return 6;
      case HASH('A','u','g'): return 7;
      case HASH('S','e','p'): return 8;
      case HASH('O','c','t'): return 9;
      case HASH('N','o','v'): return 10;
      case HASH('D','e','c'): return 11;
    }
#undef HASH
}

And here is an alternative with a look up table instead of a switch statement:

// return the month number 0..11 assuming argument is correct.
int get_month_number(const char *s) {
#define HASH(a,b,c)  ((((b) | 0x20) == 'a') + ((c) & 0x1f))
    static const unsigned char month_num[33] = {
        [ HASH('J','a','n') ] = 0,
        [ HASH('F','e','b') ] = 1,
        [ HASH('M','a','r') ] = 2,
        [ HASH('A','p','r') ] = 3,
        [ HASH('M','a','y') ] = 4,
        [ HASH('J','u','n') ] = 5,
        [ HASH('J','u','l') ] = 6,
        [ HASH('A','u','g') ] = 7,
        [ HASH('S','e','p') ] = 8,
        [ HASH('O','c','t') ] = 9,
        [ HASH('N','o','v') ] = 10,
        [ HASH('D','e','c') ] = 11,
    };
    return month_num[HASH(s[0],s[1],s[2])];
#undef HASH
}

As can be seen on Godbolt's Compiler Explorer, both functions generate branchless code with gcc and clang. The switch statement also detects hash collisions for free. Both could use your hash function:

#define HASH(a,b,c)  ((b)/4&7 ^ (c)*2&0xF)