Understanding the return from memcmp

223 views Asked by At

I wrote a function to simulate memcmp, and during testing I realized there is something unusual about memcmp. The source code for the library function is supposed to be:

#include <ansidecl.h>
#include <stddef.h>

int
memcmp(const void *str1, const void *str2, size_t count)
{
  register const unsigned char *s1 = (const unsigned char *)str1;
  register const unsigned char *s2 = (const unsigned char *)str2;

  while (count-- > 0)
    {
      if (*s1++ != *s2++)
        return s1[-1] < s2[-1] ? -1 : 1;
    }
  return 0;
}

mine is

#include <stddef.h>

int ft_memcmp(const void *s1, const void *s2, size_t n)
{
    size_t  i;

    i = 0;
    if (n == 0)
        return (0);
    while(((char *)s1)[i] == ((char *)s2)[i])
    {
        if ((((char *)s1)[i] == '\0') || (i == (n - 1)))
        {
            return (0);
        }
        i++;
    }
    if (((char *)s1)[i] < ((char *)s2)[i])
        return (-1);
    else
        return (1);
}

the unexpected issue is that, depending on the count provided, the memcmp function starts returning not 1 or -1, but the actual difference from the chars, namely s1[-1] - s2[-2]

Anyone knows why or how this happens?

this is the test I ran that showed me the issue

int main() {
    char *test_strings1[] = { "fdjkDKDJFLDkjdfkjdf", "-456", "ALO marciano!!!", "xc42:", "     7894545989828547", "  +99", "abc123", "12abc", "" };

    char *test_strings2[] = { "fdjkDKDJFLDSkjdfkjdf", "-456", "ALO_ALO marciano!!!", "xc42", "   789454598982854752", "  +99", "abc123", "12abc", "" };

    for (int count = 0; count < 50; count++)
        for (int string = 0; string < 9; string++) {
            int ft = ft_memcmp(test_strings1[string], test_strings2[string], count);
            int lib = memcmp(test_strings1[string], test_strings2[string], count);
       
            if (ft != lib) {
                printf("******Wrong!!!  lib  %i  ft  %i  count = %i  string = %i*********\n", lib,  ft, count, string);
            }
        }
    return 0;
}

My GCC version is

gcc --version
Ubuntu clang version 12.0.0-3ubuntu1~20.04.5
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

My system is Ubuntu 20.04.5 LTS, Intel® Core™ i5-7360U CPU @ 2.30GHz × 4, Mesa Intel® Iris(R) Plus Graphics 640 (Kaby Lake GT3e) (KBL GT3)

this is the output I get, which surprised me. On all the strings that have difference (index 2, 3 and 4) the behaviour of memcmp changes if count is greater than 7 (in this version of the output my custom function was returning the difference, not 1,-1 and 0)

Wrong!!!  lib  -1  ft  -63  count =  4  string =  2
Wrong!!!  lib  -1  ft  -23  count =  4  string =  4
Wrong!!!  lib  -1  ft  -63  count =  5  string =  2
Wrong!!!  lib  +1  ft  +58  count =  5  string =  3
Wrong!!!  lib  -1  ft  -23  count =  5  string =  4
Wrong!!!  lib  -1  ft  -63  count =  6  string =  2
Wrong!!!  lib  +1  ft  +58  count =  6  string =  3
Wrong!!!  lib  -1  ft  -23  count =  6  string =  4
Wrong!!!  lib  -1  ft  -63  count =  7  string =  2
Wrong!!!  lib  +1  ft  +58  count =  7  string =  3
Wrong!!!  lib  -1  ft  -23  count =  7  string =  4
3

There are 3 answers

0
Vlad from Moscow On

From the C Standard (7.24.4.1 The memcmp function)

Returns

3 The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.

13
chqrlie On

The C Standard (C2x) specifies this:

7.26.4 Comparison functions

The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

[...]

The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.

The posted code for memcmp is just a possible implementation. The C Standard does not specify the exact return value for blocks with different contents, just the sign of the return value. Hence returning s1[-1] - s2[-1] is just as compliant as returning s1[-1] < s2[-1] ? -1 : 1 or 1 - 2 * (s1[-1] < s2[-1])

Note also that your implementation has problems:

  • it makes no sense to test ((char *)s1)[i] == '\0') as memcmp does not stop on any null terminator.
  • comparing the byte values as char values is incorrect: the C Standard specifies that the comparison must be performed using the unsigned char values.

Here is a modified version:

#include <stddef.h>

int ft_memcmp(const void *s1, const void *s2, size_t n)
{
    size_t  i;

    i = 0;
    if (n == 0)
        return (0);

    while (((unsigned char *)s1)[i] == ((unsigned char *)s2)[i])
    {
        i++;
        if (i == n)
        {
            return (0);
        }
    }
    if (((unsigned char *)s1)[i] < ((unsigned char *)s2)[i])
        return (-1);
    else
        return (1);
}

The main function is incorrect:

  • it must only check for zero values and that the signs of ft and lib match if both are non zero.
  • the count passed to both functions should not exceed the size of the objects. This explains why you get inconsistent results when count is greater than 7 for the identical strings: the contents of memory beyond the first 8 or so bytes or the identical strings differ. Reading the bytes beyond the end of the strings (ie: after the null terminator) has undefined behavior anyway.

Here is a modified version:

int main(void) {
    const char *test_strings1[] = {
        "fdjkDKDJFLDkjdfkjdf", "-456", "ALO marciano!!!", "xc42:",
        "     7894545989828547", "  +99", "abc123", "12abc", ""
    };
    const char *test_strings2[] = {
        "fdjkDKDJFLDSkjdfkjdf", "-456", "ALO_ALO marciano!!!", "xc42",
        "   789454598982854752", "  +99", "abc123", "12abc", ""
    };
    int test_count = sizeof(test_strings1) / sizeof(*test_strings1);

    for (int i = 0; i < test_count; i++) {
        size_t size1 = strlen(test_strings1[i]) + 1;
        size_t size2 = strlen(test_strings2[i]) + 1;
        for (size_t count = 0; count <= size1 && count <= size2; count++) {
            int ft = ft_memcmp(test_strings1[i], test_strings2[i], count);
            int lib = memcmp(test_strings1[i], test_strings2[i], count);
       
            if ((ft < 0 && lib >= 0) 
            ||  (ft == 0 && lib != 0)
            ||  (ft > 0 && lib <= 0)) {
                printf("*** Wrong!!!  lib  %i  ft  %i  count = %zu  string = %i ***\n",
                       lib, ft, count, i);
            }
        }
    }
    return 0;
}
5
0___________ On

the original function (very simplified) is rather:

int mymemcmp(const void *s1, const void *s2, const size_t n)
{
    const unsigned char *a = s1;
    const unsigned char *b = s2;
    int result = 0;

    for(size_t index = 0; index < n; index ++)
    {
        if(a[index] != b[index])
        {
            result = a[index] - b[index];
            break;
        }
    }
    return result;
}