I am trying to write the more simple version of the Boyer Moore algorithm without prefix-function. It has to print all positions of symbols which have been compared with a pattern. And locally it passes the tests, but it fails when I commit it to the gitlab. I can't spot undefined behaviour here. I have garbage at the end.
#include <stdio.h>
#define MAX_PATTERN_LEN 16
#define BUF_SIZE 69
#define ALPH_SIZE 128
int read_str(int max_len, unsigned char *str_place) {
int count_read = 0;
for (int i = 0, ch; i < max_len; i++) {
if ((ch = getchar()) == '\n') {
str_place[i] = '\0';
break;
}
str_place[i] = (char)ch;
count_read++;
}
return count_read;
}
void calculate_shifts(const unsigned char *str, int len_str, int *badchar) {
for (int i = 0; i < ALPH_SIZE; i++)
badchar[i] = len_str;
for (int i = 0; i < len_str - 1; i++)
badchar[str[i]] = len_str - 1 - i;
}
void search(const unsigned char *str, const unsigned char *patt, int len_patt, int len_str) {
int badchar[ALPH_SIZE];
calculate_shifts(patt, len_patt, badchar);
int shift = 0;
while (shift <= (len_str - len_patt)) {
int j = len_patt - 1;
for (; j >= 0 && patt[j] == str[shift + j]; j--)
printf("%d ", shift + j + 1);
if (j < 0) {
shift += ((shift + len_patt) < len_str) ? badchar[patt[len_patt - 1]] : 1;
} else {
printf("%d ", shift + j + 1);
int shift_addition = badchar[str[shift + j]];
if ((shift_addition == len_patt) && (j < len_patt - 1) && (patt[len_patt - 1] == patt[0]))
shift_addition--;
shift += shift_addition;
}
}
}
int main(void) {
unsigned char str[BUF_SIZE + 1];
unsigned char patt[MAX_PATTERN_LEN + 1];
int len_patt = read_str(MAX_PATTERN_LEN + 1, patt);
int len_str = read_str(BUF_SIZE + 1, str);
if (!len_patt || !len_str)
return 0;
search(str, patt, len_patt, len_str);
return 0;
}
the test:
example
this is simple example
the correct output:
7 14 13 12 11 10 20 22 21 20 19 18 17 16
the actual output:
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 ..
The difference (extra number 28 at the end of the output) can be caused by a missing newline character (
\n
) at the end of the last line of the input. I was able to reproduce both of your outputs (with and without the 28) locally.TL;DR Fix the bug in your code by changing
(ch = getchar()) == '\n'
to(ch = getchar()) == EOF || ch == '\n'
, and changing#define ALPH_SIZE 128
to#define ALPH_SIZE 256
.Maybe there is another bug, I haven't checked. If you encounter any other bugs, please ask it in a separate question on StackOverflow.
The rest of this answer explains how I diagnosed this bug.
AddressSanitizer
gcc -fsanitize=address
) reveals a memory access problem if the newline is missing:Please note that I had to rerun it multiple times to trigger the AddressSanitizer error message.
Most of the time this is caused by a bug in your program, and often it's undefined behavior. The
in search /tmp/t.c:33
clause in the output of AddressSanitizer indicates where in your source code the bad read happens (i.e. line 33, functionsearch
).I've added some extra
printf
s to your code to see what happens in line 33. The output is:The culprit is the last read access:
str[-369011759] == ??
, this indicates thatshift + j
becomes -369011759, which is obviously incorrect, because correct values are small nonnegative integers.Please note that I had to rerun it multiple times to trigger the Address Sanitizer error message, and the number -369011759 was different each time.
It looks like
j
has a sane value of 6. To figure out the actual bug, the next step is checking whereshift
gets its large negative value.At this point the output line
str[27] == 255
was suspicious to me. I checked your code whether it handles such large values instr
correctly, and then I found a bug.One way to fix this newfound bug is changing
ALPH_SIZE
from128
to256
. (I've verified that it makes the Address Sanitizer errors disappear.) Here is why. Line 39 contains the read ofbadchar[str[shift + j]]
. For that to be valid, we need0 <= str[shift + j] && str[shift + j] < ALPH_SIZE
, becauseALPH_SIZE
is the size of thebadchar
array. However, the maximum value forstr[...]
is 255 (because it's an unsigned char), thus we needALPH_SIZE >= 256
.Indeed, an out-of-bounds array access (read or write) is undefined behavior in C. To prove that this is actually happening, I've changed line 39 to
, and I've also changed line 28 to use malloc:
, and I've also added
free(badchar)
later. With all that I've got:Indeed, the output of AddressSanitizer indicates the bug in line 39, and the output of the extra
fprintf
calls confirm that it's an out-of-bounds array access: reading of element 255 of the arraybadchar
, which has only 128 elements.How did we end up with
str[27] == 255
(thus readingbadchar[255]
)? That's easy: if there is no terminating\n
at the end of the file when readingstr
, thengetchar()
returns EOF, which becomes 255 when assigned to anunsigned char
. (The obvious fix, as above, is to stop at EOF instead, and changeALPH_SIZE
to 256.)Why hasn't AddressSanitizer reported the read of out-of-bounds array index problem (as a stack-buffer-overflow) in line 39 before
badchar
was changed to usemalloc
(reported as heap-buffer-overflow)? I don't quite understand this yet. AddressSanitizer should be able to report stack-buffer-overflow reliably. Probably it depends on the GCC and Clang version, because changing the commandgcc
toclang
made the stack-buffer-overflow error appear reliably. Probably upgrading to the latest GCC and Clang would help. Valgrind is not as reliable for detecting stack-buffer-overflow reads as AddressSanitizer.