I'm trying to implement a DVBS2 (48408, 48600) BCH decoder and I'm having troubles with finding the roots of the locator polynomial. For the Chien search here the author initialises the registers taking into account the fact that it is shortened subtracting 48600 from (2^16 - 1). Why so? This is the code I have so far:
function [error_locations, errors] = compute_chien_search(n, locator_polynomial, field, alpha_powers, n_max)
       
    t = length(locator_polynomial);
    error_locations = zeros(t, 1);
    errors = 0;
    % Init the registers with the locator polynomial coefficients.
    coefficient_buffer = locator_polynomial;
    alpha_degrees = uint32(1:t)';
    alpha_polynoms = field(alpha_degrees);
    alpha_polynoms = [1; alpha_polynoms];
    for i = 1:n % 
        for j = 2:t
            coefficient_buffer(j) = gf_mul_elements(coefficient_buffer(j), ...
                                                    alpha_polynoms(j), ...
                                                    field, alpha_powers, n_max);
        end
        % Compute locator polynomial at position i
        tmp = 0;
        for j = 2:t
            tmp = bitxor(tmp, coefficient_buffer(j));
        end
        % Signal the error
        if tmp == 1       
            errors = errors + 1;
            error_locations(errors) = n_max - i + 1;
        end
    end
end
It almost gives me the correct result except for some error locations. For example: for errors made in positions
   418       14150       24575       25775       37403
The code above gives me
  48183      47718       34451       24026       22826
which after subtracting from 48600 gives:
  417        882         14149       24574       25774
which is the position minus 1, except for the 37403, which it did not find. What am I missing?
Edit:
The code in question is a DVBS2 12 error correcting 48408, 48600 BCH code. The generator polynomial has degree 192 and is given by multiplying the 12 minimal polynomials given on the standard’s documentation.
 
                        
Update - I created an example C program using Windows | Visual Studio for BCH(48600,48408). On my desktop (Intel 3770K 3.5 ghz, Win 7, VS 2015), encode takes about 30 us, a 12 bit error correction takes about 4.5 ms. On my laptop, (Intel Core i7-10510U up to 4.9 ghz, Win 10, VS 2019), 12 bit error correction takes about 3.0 ms. I used a carryless multiply intrinsic to simplify generating the 192 bit polynomial, but this is a one time generated constant. Encode uses a [256][3] 64 bit unsigned integer polynomial (192 bits) table and decode uses a [256][12] 16 bit unsigned integer syndrome table, to process a byte at a time.
The code includes both Berlekamp Massey and Sugiyama extended Euclid decoders that I copied from existing RS code I have. For BCH (not RS) code, the Berlekamp Massey discrepancy will be zero on odd steps, so for odd steps, the discrepancy is not calculated (the iteration count since last update is incremented, the same as when a calculated discrepancy is zero). I didn't see a significant change in running time, but I left the check in there.
The run times are about the same for BM or Euclid.
https://github.com/jeffareid/misc/blob/master/bch48600.c
I suspect an overflow problem in the case of a failure at bit error index 37403, since it is the only bit index > 2^15-1 (32767). There is this comment on that web site:
The Chien search should be looking for values (1/(2^(0))) to (1/(2^(48599))), then zero - log of those values to get offsets relative to the right most bit of the message, and 48599-offset to get indexes relative to the left most bit of the message.
If the coefficients of the error locator polynomial are reversed, then the Chien search would be looking for values 2^(0) to 2^(48599).
https://en.wikipedia.org/wiki/Reciprocal_polynomial