C# Boyer-Moore Algorithm with needle can contain a null value as a wildcard

186 views Asked by At

I attempted to implement the Boyer-Moore algorithm in C#, with the ability to use null as a wildcard in the needle (pattern).

class BoyerMoore
{
    private readonly int[] _badChar;
    private readonly byte?[] _needle;

    public BoyerMoore(byte?[] needle)
    {
        _needle = needle;
        _badChar = new int[256];

        // Pre-processing for bad character heuristic
        for (int i = 0; i < _badChar.Length; i++)
        {
            _badChar[i] = -1;
        }
        for (int i = 0; i < needle.Length; i++)
        {
            if (needle[i] != null)
                _badChar[needle[i].Value] = i;
        }
    }

    public List<int> Search(byte[] haystack)
    {
        List<int> occurrences = new List<int>();
        int i = 0;
        while (i <= haystack.Length - _needle.Length)
        {
            int j;

            for (j = _needle.Length - 1; j >= 0; j--)
            {
                if (_needle[j] == null) continue;
                if (_needle[j] != haystack[i + j]) break;
            }

            if (j < 0)
            {
                occurrences.Add(i);
                i++;
            }
            else
            {
                i += Math.Max(1, j - _badChar[haystack[i + j]]);
            }
        }
        return occurrences;
    }
}

My code works correctly when the needle does not contain a null, but it does not work properly when the needle contains a null, such as 0xAA, 0xBB, null, 0xCC. (missing some results.)

Am I overlooking something or is it not possible to implement the Boyer-Moore Bad Character heuristic with a null wildcard?

I searched on Google, but I don't see any examples, tutorials or explains something using null values as wildcards, so I ask.

1

There are 1 answers

1
ricardkelly On BEST ANSWER

The Boyer-Moore bad character rule doesn't work with wildcards. Basically, you have to throw away everything to the left of the rightmost wildcard when operating that rule.

Consider the _badChar array gives you offsets to move the search position quickly where the character that you see at the current haystack position is not in the needle. If you have a wildcard in your needle, then that could match any character, and right now you are saying that those characters are not matched by anything in the needle by setting their position to -1.

So, rather than set the entry in _badChar for each value not explicitly present in the needle to -1, you want to set the entry to the last wildcard position in the needle.

But you also have to set the position for every other character that is in the needle to no further left than this point, because it could match them too.

You can do that in the setup:

for (int i = 0 ; i < needle.Length; i++)
    if (needle[i] != null)
        _badChar[needle[i].Value] = i;
    else
        _lastNull = i;
for (int i = 0; i < _badChar.Length; i++)
    if (_badChar[i] < _lastNull)
        _badChar[i] = _lastNull;

But now you are effectively running Boyer-Moore on only the rightmost part of the needle, and doing a linear match on the rest of the needle each time that the Boyer-Moore algorithm gives you a match.