Undefined behaviour in a simple version of the Boyer Moore algorithm

175 views Asked by At

I am trying to write the more simple version of the Boyer Moore algorithm without prefix-function. It has to print all positions of symbols which have been compared with a pattern. And locally it passes the tests, but it fails when I commit it to the gitlab. I can't spot undefined behaviour here. I have garbage at the end.

#include <stdio.h>

#define MAX_PATTERN_LEN 16
#define BUF_SIZE 69
#define ALPH_SIZE 128

int read_str(int max_len, unsigned char *str_place) {
    int count_read = 0;
    for (int i = 0, ch; i < max_len; i++) {
        if ((ch = getchar()) == '\n') {
            str_place[i] = '\0';
            break;
        }
        str_place[i] = (char)ch;
        count_read++;
    }
    return count_read;
}

void calculate_shifts(const unsigned char *str, int len_str, int *badchar) {
    for (int i = 0; i < ALPH_SIZE; i++)
        badchar[i] = len_str;
    for (int i = 0; i < len_str - 1; i++)
        badchar[str[i]] = len_str - 1 - i;
}

void search(const unsigned char *str, const unsigned char *patt, int len_patt, int len_str) {
    int badchar[ALPH_SIZE];
    calculate_shifts(patt, len_patt, badchar);
    int shift = 0;
    while (shift <= (len_str - len_patt)) {
        int j = len_patt - 1;
        for (; j >= 0 && patt[j] == str[shift + j]; j--)
            printf("%d ", shift + j + 1);
        if (j < 0) {
            shift += ((shift + len_patt) < len_str) ? badchar[patt[len_patt - 1]] : 1;
        } else {
            printf("%d ", shift + j + 1);
            int shift_addition = badchar[str[shift + j]];
            if ((shift_addition == len_patt) && (j < len_patt - 1) && (patt[len_patt - 1] == patt[0]))
                shift_addition--;
            shift += shift_addition;
        }
    }
}

int main(void) {
    unsigned char str[BUF_SIZE + 1];
    unsigned char patt[MAX_PATTERN_LEN + 1];
    int len_patt = read_str(MAX_PATTERN_LEN + 1, patt);
    int len_str = read_str(BUF_SIZE + 1, str);
    if (!len_patt || !len_str)
        return 0;
    search(str, patt, len_patt, len_str);
    return 0;
} 

the test:

example
this is simple example

the correct output:

7 14 13 12 11 10 20 22 21 20 19 18 17 16

the actual output:

7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 ..
3

There are 3 answers

1
pts On BEST ANSWER

The difference (extra number 28 at the end of the output) can be caused by a missing newline character (\n) at the end of the last line of the input. I was able to reproduce both of your outputs (with and without the 28) locally.

$ gcc -g -O2 -W -Wall -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 

TL;DR Fix the bug in your code by changing (ch = getchar()) == '\n' to (ch = getchar()) == EOF || ch == '\n', and changing #define ALPH_SIZE 128 to #define ALPH_SIZE 256.

Maybe there is another bug, I haven't checked. If you encounter any other bugs, please ask it in a separate question on StackOverflow.


The rest of this answer explains how I diagnosed this bug.

AddressSanitizer gcc -fsanitize=address) reveals a memory access problem if the newline is missing:

$ gcc -g -O2 -W -Wall -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16 
$ printf 'example\nthis is simple example; ' | ./t; echo
ASAN:DEADLYSIGNAL
=================================================================
==19294==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffe25fbe264 (pc 0x563d6542b204 bp 0x7ffe25fbe264 sp 0x7ffe9a1bbc70 T0)
==19294==The signal is caused by a READ memory access.
    #0 0x563d6542b203 in search /tmp/t.c:33
    #1 0x563d6542acb1 in main /tmp/t.c:54
    #2 0x7f617ff96c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #3 0x563d6542ad99 in _start (/tmp/t+0xd99)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in search
==19294==ABORTING

Please note that I had to rerun it multiple times to trigger the AddressSanitizer error message.

Most of the time this is caused by a bug in your program, and often it's undefined behavior. The in search /tmp/t.c:33 clause in the output of AddressSanitizer indicates where in your source code the bad read happens (i.e. line 33, function search).

I've added some extra printfs to your code to see what happens in line 33. The output is:

$ printf 'example\nthis is simple example; ' | ./t; echo
patt[6] == 101
str[6] == 115
patt[6] == 101
str[13] == 101
patt[5] == 108
str[12] == 108
patt[4] == 112
str[11] == 112
patt[3] == 109
str[10] == 109
patt[2] == 97
str[9] == 105
patt[6] == 101
str[19] == 112
patt[6] == 101
str[21] == 101
patt[5] == 108
str[20] == 108
patt[4] == 112
str[19] == 112
patt[3] == 109
str[18] == 109
patt[2] == 97
str[17] == 97
patt[1] == 120
str[16] == 120
patt[0] == 101
str[15] == 101
patt[6] == 101
str[27] == 255
patt[6] == 101
str[-369011759] == ??
ASAN:DEADLYSIGNAL
=================================================================
==19906==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffc6e5c0c71 (pc 0x5621e2363094 bp 0x0000ea0153d1 sp 0x7ffc845ab540 T0)
==19906==The signal is caused by a READ memory access.
    #0 0x5621e2363093 in gett /tmp/t.c:33
    #1 0x5621e236349b in search /tmp/t.c:42
    #2 0x5621e2362e61 in main /tmp/t.c:63
    #3 0x7f7063ee8c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #4 0x5621e2362f49 in _start (/tmp/t+0xf49)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in gett
==19906==ABORTING

The culprit is the last read access: str[-369011759] == ??, this indicates that shift + j becomes -369011759, which is obviously incorrect, because correct values are small nonnegative integers.

Please note that I had to rerun it multiple times to trigger the Address Sanitizer error message, and the number -369011759 was different each time.

It looks like j has a sane value of 6. To figure out the actual bug, the next step is checking where shift gets its large negative value.

At this point the output line str[27] == 255 was suspicious to me. I checked your code whether it handles such large values in str correctly, and then I found a bug.

One way to fix this newfound bug is changing ALPH_SIZE from 128 to 256. (I've verified that it makes the Address Sanitizer errors disappear.) Here is why. Line 39 contains the read of badchar[str[shift + j]]. For that to be valid, we need 0 <= str[shift + j] && str[shift + j] < ALPH_SIZE, because ALPH_SIZE is the size of the badchar array. However, the maximum value for str[...] is 255 (because it's an unsigned char), thus we need ALPH_SIZE >= 256.

Indeed, an out-of-bounds array access (read or write) is undefined behavior in C. To prove that this is actually happening, I've changed line 39 to

        printf("%d ", shift + j + 1); fprintf(stderr, "RBC %d == %d\n", shift + j, str[shift + j]);

, and I've also changed line 28 to use malloc:

    int *badchar = malloc(sizeof(int) * ALPH_SIZE);

, and I've also added free(badchar) later. With all that I've got:

$ gcc -g -O2 -W -Wall -include stdlib.h -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
7 14 13 12 11 10 20 22 21 20 19 18 17 16 
$ printf 'example\nthis is simple example; ' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
RBC 27 == 255
=================================================================
==23463==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61500000047c at pc 0x563915b3754a bp 0x7fff5f340710 sp 0x7fff5f340700
READ of size 4 at 0x61500000047c thread T0
    #0 0x563915b37549 in search /tmp/t.c:39
    #1 0x563915b36dc1 in main /tmp/t.c:54
    #2 0x7f1a63d6dc86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #3 0x563915b36ea9 in _start (/tmp/t+0xea9)

Address 0x61500000047c is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/t.c:39 in search
Shadow bytes around the buggy address:
  0x0c2a7fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c2a7fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c2a7fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c2a7fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
  0x0c2a7fff8090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==23463==ABORTING

Indeed, the output of AddressSanitizer indicates the bug in line 39, and the output of the extra fprintf calls confirm that it's an out-of-bounds array access: reading of element 255 of the array badchar, which has only 128 elements.

How did we end up with str[27] == 255 (thus reading badchar[255])? That's easy: if there is no terminating \n at the end of the file when reading str, then getchar() returns EOF, which becomes 255 when assigned to an unsigned char. (The obvious fix, as above, is to stop at EOF instead, and change ALPH_SIZE to 256.)

Why hasn't AddressSanitizer reported the read of out-of-bounds array index problem (as a stack-buffer-overflow) in line 39 before badchar was changed to use malloc (reported as heap-buffer-overflow)? I don't quite understand this yet. AddressSanitizer should be able to report stack-buffer-overflow reliably. Probably it depends on the GCC and Clang version, because changing the command gcc to clang made the stack-buffer-overflow error appear reliably. Probably upgrading to the latest GCC and Clang would help. Valgrind is not as reliable for detecting stack-buffer-overflow reads as AddressSanitizer.

0
John Bollinger On

Your read_str() function is flawed in two ways:

  1. It does not recognize or handle end-of-file on the standard input. This situation will not usually be encountered when the program's standard input is connected to a terminal, but it is entirely possible if the input is redirected from a regular file or a pipe (for example). This will likely result in unwanted extra characters at the end of the string.

  2. It will not add a string terminator to maximum-length input.

And those combine, because if read_str() does reach end-of-file then it will always behave as if it received a maximum-length input, and therefore fail to terminate it.

Failure to terminate will cause your program to exhibit undefined behavior elsewhere, but the source of the problem is read_str(). My first recommendation would be to dump your read_str() altogether, and use standard fgets() instead:

    fgets(str, BUF_SIZE + 1, stdin);
    len_str = strcspn(str, "\n");
    str[len_str] = '\0';  // trim a trailing newline, if any

But if you insist on your own read_str(), with parameters having the semantics implied by your code, then you need to correct the two problems described above. Something like this, for example:

int read_str(int max_len, unsigned char *str_place) {
    int count_read = 0;
    int ch = EOF;
    for (int i = 0; i < max_len; i++) {
        ch = getchar();

        // break at newline OR EOF
        if (ch == '\n' || ch == EOF) {
            break;
        }

        str_place[i] = (char)ch;
        count_read++;
    }

    // If max_len characters were read without seeing end-of-line or end-of-file
    // then push the last back and adjust the number of characters accepted
    if (count_read >= max_len) {
        ungetc(ch, stdin);
        count_read--;
    }

    // Terminate the string after the last character accepted
    str_place[count_read] = '\0';
    return count_read;
}
0
chqrlie On

There are multiple problems:

  • ALPH_SIZE is defined as 128, which is not range of type unsigned char. You should use (UCHAR_MAX+1), or 256 for 8-bit bytes, to allow the badchar array to handle characters in the string beyond 127 such as UTF-8 leading and continuation bytes or other non-ASCII contents. This is potential undefined behavior.

  • you pass the length of the destination array to read_str for the argument max_len and you only null terminate this array if you read a newline before reaching max_len characters. This is error prone: it might not cause a problem in the posted code, but is not recommended practice, especially as you are explicitly null terminating the array for shorter strings.

  • the reading loop in read_str does not stop at end of file because you only test for '\n'. If the last line of input does not end with a newline, the loop will keep trying to read from stdin, receiving EOF from getchar() and storing '\377' (ie: 0xff or 255) into the array, later causing undefined behavior as the array badchar is too short to handle these byte values.