What is the fastest way to compare two strings in C?

2.7k views Asked by At

For clarity I'm only talking about null terminated strings.

I'm familiar with the standard way of doing string comparisons in C with the usage of strcmp. But I feel like it's slow and inefficient.

I'm not necessarily looking for the easiest method but the most efficient.

Can the current comparison method (strcmp) be optimized further while the underlying code remains cross platform?

If strcmp can't be optimized further, what is the fastest way which I could perform the string comparison without strcmp?

Current use case:

  • Determine if two arbitrary strings match
  • Strings will not exceed 4096 bytes, nor be less than 1 byte in size
  • Strings are allocated/deallocated and compared within the same code/library
  • Once comparison is complete I do pass the string to another C library which needs the format to be in a standard null terminated format
  • System memory limits are not a huge concern, but I will have tens of thousands of such strings queued up for comparison
  • Strings may contain high-ascii character set or UTF-8 characters but for my purposes I only need to know if they match, content is not a concern
  • Application runs on x86 but should also run on x64

Reference to current strcmp() implementation:

Edit: Clarified the solution does not need to be a modification of strcmp.

Edit 2: Added specific examples for this use case.

1

There are 1 answers

13
chqrlie On BEST ANSWER

I'm afraid your reference imlementation for strcmp() is both inaccurate and irrelevant:

  • it is inaccurate because it compares characters using the char type instead of the unsigned char type as specified in the C11 Standard:

    7.24.4 Comparison functions

    The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

  • It is irrelevant because the actual implementation used by modern compilers is much more sophisticated, expanded inline using hand-coded assembly language.

Any generic implementation is likely to be less optimal, especially if coded to remain portable across platforms.

Here are a few directions to explore if your program's bottleneck is comparing strings.

  • Analyze your algorithms, try and find ways to reduce the number of comparisons: for example if you search for a string in an array, sorting that array and using a binary search with drastically reduce the number of comparisons.
  • If your strings are tokens used in many different places, allocate unique copies of these tokens and use those as scalar values. The strings will be equal if and only if the pointers are equal. I use this trick in compilers and interpreters all the time with a hash table.
  • If your strings have the same known length, you can use memcmp() instead of strcmp(). memcmp() is simpler than strcmp() and can be implemented even more efficiently in places where the strings are known to be properly aligned.

EDIT: with the extra information provided, you could use a structure like this for your strings:

typedef struct string_t {
    size_t len;
    size_t hash;  // optional
    char str[];   // flexible array, use [1] for pre-c99 compilers
} string_t;

You allocate this structure this way:

string_t *create_str(const char *s) {
    size_t len = strlen(s);
    string_t *str = malloc(sizeof(*str) + len + 1;
    str->len = len;
    str->hash = hash_str(s, len);
    memcpy(str->str, s, len + 1);
    return str;
}

If you can use these str things for all your strings, you can greatly improve the efficiency of the matching by first comparing the lengths or the hashes. You can still pass the str member to your library function, it is properly null terminated.