C#: How to make Sieve of Atkin incremental

3.2k views Asked by At

I don't know if this is possible or not, but I just gotta ask. My mathematical and algorithmic skills are kind of failing me here :P

The thing is I now have this class that generates prime numbers up to a certain limit:

public class Atkin : IEnumerable<ulong>
{
    private readonly List<ulong> primes;
    private readonly ulong limit;

    public Atkin(ulong limit)
    {
        this.limit = limit;
        primes = new List<ulong>();
    }

    private void FindPrimes()
    {
        var isPrime = new bool[limit + 1];
        var sqrt = Math.Sqrt(limit);

        for (ulong x = 1; x <= sqrt; x++)
            for (ulong y = 1; y <= sqrt; y++)
            {
                var n = 4*x*x + y*y;
                if (n <= limit && (n % 12 == 1 || n % 12 == 5))
                    isPrime[n] ^= true;

                n = 3*x*x + y*y;
                if (n <= limit && n % 12 == 7)
                    isPrime[n] ^= true;

                n = 3*x*x - y*y;
                if (x > y && n <= limit && n % 12 == 11)
                    isPrime[n] ^= true;
            }

        for (ulong n = 5; n <= sqrt; n++)
            if (isPrime[n])
            {
                var s = n * n;
                for (ulong k = s; k <= limit; k += s)
                    isPrime[k] = false;
            }

        primes.Add(2);
        primes.Add(3);
        for (ulong n = 5; n <= limit; n++)
            if (isPrime[n])
                primes.Add(n);
    }


    public IEnumerator<ulong> GetEnumerator()
    {
        if (!primes.Any())
            FindPrimes();

        foreach (var p in primes)
            yield return p;
    }


    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

Now, what I would like is to get rid of the limit so that the sequence would never stop (theoretically).

I am thinking it could go something like this:

  1. Start with some trivial limit
    • Find all the primes up to the limit
    • Yield all the newfound primes
    • Increase the limit (by doubling or squaring the old limit or something like that)
    • Goto step 2

And optimally that step two should only have to work between the old limit and the new one. In other words it shouldn't have to find the lowest primes again and again.

Is there a way this can be done? My main problem is that I don't quite understand what x and y for example is in this algorithm. Like, could I just use the same algorithm kind of but set x and y to oldLimit (initially 1) and run it up to newLimit? Or how would that work? Any bright minds with some light to shed on this?

The point of this is so that I won't have to set that limit. So that I can for example use Linq and just Take() however many primes I need, not worrying about if the limit is high enough, et cetera.

5

There are 5 answers

0
GordonBGood On

The following code does the optimizations as discussed at the bottom of my previous answer and includes the following features:

  1. The usable range has been increased to the 64-but unsigned number range of 18,446,744,073,709,551,615 with the range overflow checks removed since it is unlikely that one would want to run the program for the hundreds of years it would take to process the full range of numbers to that limit. This is at very little cost in processing time as the paging can be done using 32-bit page ranges and only the final prime output needs to be computed as a 64-bit number.
  2. It has increased the wheel factorization from a 2,3,5 wheel to use a 2,3,5,7 prime factor wheel with an additional pre-cull of composite numbers using the the additional primes of 11, 13, and 17, to greatly reduce the redundant composite number culling (now only culling each composite number an average of about 1.5 times). Due to the (DotNet related) computational overheads of doing this (also applies for the 2,3,5 wheel as the previous version) the actual time saving in culling isn't all that great but enumerating the answers is somewhat faster due to many of the "simple" composite numbers being skipped in the packed bit representation.
  3. It still uses the Task Parallel Library (TPL) from DotNet 4 and up for multi-threading from the thread pool on a per page basis.
  4. It now uses a base primes representation that supports automatically growing the array contained by this class as more base primes are required as a thread safe method instead of the fixed pre-computed base primes array used previously.
  5. The base primes representation has been reduced to one byte per base prime for a further reduction in memory footprint; thus, the total memory footprint other than the code is the array to hold this base primes representation for the primes up to the square root of the current range being processed, and the packed bit page buffers which are currently set at under the L2 cache size of 256 Kilobytes (smallest page size of 14,586 bytes times the CHNKSZ of 17 as supplied) each per CPU core plus one extra buffer for the foreground task to process. With this code, about three Megabytes is sufficient to process the prime range up to ten to the fourteenth power. As well as speed due to allowing efficient multiprocessing, this reduce memory requirement is the other advantage of using a paged sieve implementation.

    class UltimatePrimesSoE : IEnumerable<ulong> {
      static readonly uint NUMPRCSPCS = (uint)Environment.ProcessorCount + 1; const uint CHNKSZ = 17;
      const int L1CACHEPOW = 14, L1CACHESZ = (1 << L1CACHEPOW), MXPGSZ = L1CACHESZ / 2; //for buffer ushort[]
      //the 2,3,57 factorial wheel increment pattern, (sum) 48 elements long, starting at prime 19 position
      static readonly byte[] WHLPTRN = { 2,3,1,3,2,1,2,3,3,1,3,2,1,3,2,3,4,2,1,2,1,2,4,3,
                                                                               2,3,1,2,3,1,3,3,2,1,2,3,1,3,2,1,2,1,5,1,5,1,2,1 }; const uint FSTCP = 11;
      static readonly byte[] WHLPOS; static readonly byte[] WHLNDX; //to look up wheel indices from position index
      static readonly byte[] WHLRNDUP; //to look up wheel rounded up index positon values, allow for overfolw
      static readonly uint WCRC = WHLPTRN.Aggregate(0u, (acc, n) => acc + n);
      static readonly uint WHTS = (uint)WHLPTRN.Length; static readonly uint WPC = WHTS >> 4;
      static readonly byte[] BWHLPRMS = { 2, 3, 5, 7, 11, 13, 17 }; const uint FSTBP = 19;
      static readonly uint BWHLWRDS = BWHLPRMS.Aggregate(1u, (acc, p) => acc * p) / 2 / WCRC * WHTS / 16;
      static readonly uint PGSZ = MXPGSZ / BWHLWRDS * BWHLWRDS; static readonly uint PGRNG = PGSZ * 16 / WHTS * WCRC;
      static readonly uint BFSZ = CHNKSZ * PGSZ, BFRNG = CHNKSZ * PGRNG; //number of uints even number of caches in chunk
      static readonly ushort[] MCPY; //a Master Copy page used to hold the lower base primes preculled version of the page
      struct Wst { public ushort msk; public byte mlt; public byte xtr; public ushort nxt; }
      static readonly byte[] PRLUT; /*Wheel Index Look Up Table */ static readonly Wst[] WSLUT; //Wheel State Look Up Table
      static readonly byte[] CLUT; // a Counting Look Up Table for very fast counting of primes
      static int count(uint bitlim, ushort[] buf) { //very fast counting
        if (bitlim < BFRNG) { var addr = (bitlim - 1) / WCRC; var bit = WHLNDX[bitlim - addr * WCRC] - 1; addr *= WPC;
          for (var i = 0; i < 3; ++i) buf[addr++] |= (ushort)((unchecked((ulong)-2) << bit) >> (i << 4)); }
        var acc = 0; for (uint i = 0, w = 0; i < bitlim; i += WCRC)
          acc += CLUT[buf[w++]] + CLUT[buf[w++]] + CLUT[buf[w++]]; return acc; }
      static void cull(ulong lwi, ushort[] b) { ulong nlwi = lwi;
        for (var i = 0u; i < b.Length; nlwi += PGRNG, i += PGSZ) MCPY.CopyTo(b, i); //copy preculled lower base primes.
        for (uint i = 0, pd = 0; ; ++i) { pd += (uint)baseprms[i] >> 6;
          var wi = baseprms[i] & 0x3Fu; var wp = (uint)WHLPOS[wi]; var p = pd * WCRC + PRLUT[wi];
          var pp = (p - FSTBP) >> 1; var k = (ulong)p * (pp + ((FSTBP - 1) >> 1)) + pp;
          if (k >= nlwi) break; if (k < lwi) { k = (lwi - k) % (WCRC * p);
            if (k != 0) { var nwp = wp + (uint)((k + p - 1) / p); k = (WHLRNDUP[nwp] - wp) * p - k;
              if (nwp >= WCRC) wp = 0; else wp = nwp; } }
          else k -= lwi; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
          for (uint wrd = (uint)kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < b.Length; ) {
            var st = WSLUT[ndx]; b[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
      static Task cullbf(ulong lwi, ushort[] b, Action<ushort[]> f) {
        return Task.Factory.StartNew(() => { cull(lwi, b); f(b); }); }
      class Bpa {   //very efficient auto-resizing thread-safe read-only indexer class to hold the base primes array
        byte[] sa = new byte[0]; uint lwi = 0, lpd = 0; object lck = new object();
        public uint this[uint i] { get { if (i >= this.sa.Length) lock (this.lck) {
                var lngth = this.sa.Length; while (i >= lngth) {
                  var bf = (ushort[])MCPY.Clone(); if (lngth == 0) {
                    for (uint bi = 0, wi = 0, w = 0, msk = 0x8000, v = 0; w < bf.Length;
                        bi += WHLPTRN[wi++], wi = (wi >= WHTS) ? 0 : wi) {
                      if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1;
                      if ((v & msk) == 0) { var p = FSTBP + (bi + bi); var k = (p * p - FSTBP) >> 1;
                        if (k >= PGRNG) break; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
                        for (uint wrd = kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < bf.Length; ) {
                          var st = WSLUT[ndx]; bf[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } } }
                  else { this.lwi += PGRNG; cull(this.lwi, bf); }
                  var c = count(PGRNG, bf); var na = new byte[lngth + c]; sa.CopyTo(na, 0);
                  for (uint p = FSTBP + (this.lwi << 1), wi = 0, w = 0, msk = 0x8000, v = 0;
                      lngth < na.Length; p += (uint)(WHLPTRN[wi++] << 1), wi = (wi >= WHTS) ? 0 : wi) {
                    if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1; if ((v & msk) == 0) {
                      var pd = p / WCRC; na[lngth++] = (byte)(((pd - this.lpd) << 6) + wi); this.lpd = pd; }
                  } this.sa = na; } } return this.sa[i]; } } }
      static readonly Bpa baseprms = new Bpa();
      static UltimatePrimesSoE() {
        WHLPOS = new byte[WHLPTRN.Length + 1]; //to look up wheel position index from wheel index
        for (byte i = 0, acc = 0; i < WHLPTRN.Length; ++i) { acc += WHLPTRN[i]; WHLPOS[i + 1] = acc; }
        WHLNDX = new byte[WCRC + 1]; for (byte i = 1; i < WHLPOS.Length; ++i) {
          for (byte j = (byte)(WHLPOS[i - 1] + 1); j <= WHLPOS[i]; ++j) WHLNDX[j] = i; }
        WHLRNDUP = new byte[WCRC * 2]; for (byte i = 1; i < WHLRNDUP.Length; ++i) {
          if (i > WCRC) WHLRNDUP[i] = (byte)(WCRC + WHLPOS[WHLNDX[i - WCRC]]); else WHLRNDUP[i] = WHLPOS[WHLNDX[i]]; }
        Func<ushort, int> nmbts = (v) => { var acc = 0; while (v != 0) { acc += (int)v & 1; v >>= 1; } return acc; };
        CLUT = new byte[1 << 16]; for (var i = 0; i < CLUT.Length; ++i) CLUT[i] = (byte)nmbts((ushort)(i ^ -1));
        PRLUT = new byte[WHTS]; for (var i = 0; i < PRLUT.Length; ++i) {
          var t = (uint)(WHLPOS[i] * 2) + FSTBP; if (t >= WCRC) t -= WCRC; if (t >= WCRC) t -= WCRC; PRLUT[i] = (byte)t; }
        WSLUT = new Wst[WHTS * WHTS]; for (var x = 0u; x < WHTS; ++x) {
          var p = FSTBP + 2u * WHLPOS[x]; var pr = p % WCRC;
          for (uint y = 0, pos = (p * p - FSTBP) / 2; y < WHTS; ++y) {
            var m = WHLPTRN[(x + y) % WHTS];
            pos %= WCRC; var posn = WHLNDX[pos]; pos += m * pr; var nposd = pos / WCRC; var nposn = WHLNDX[pos - nposd * WCRC];
            WSLUT[x * WHTS + posn] = new Wst { msk = (ushort)(1 << (int)(posn & 0xF)), mlt = (byte)(m * WPC),
                                               xtr = (byte)(WPC * nposd + (nposn >> 4) - (posn >> 4)),
                                               nxt = (ushort)(WHTS * x + nposn) }; } }
        MCPY = new ushort[PGSZ]; foreach (var lp in BWHLPRMS.SkipWhile(p => p < FSTCP)) { var p = (uint)lp;
          var k = (p * p - FSTBP) >> 1; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
          for (uint w = kd * WPC + (uint)(kn >> 4), ndx = WHLNDX[(2 * WCRC + p - FSTBP) / 2] * WHTS + kn; w < MCPY.Length; ) {
            var st = WSLUT[ndx]; MCPY[w] |= st.msk; w += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
      struct PrcsSpc { public Task tsk; public ushort[] buf; }
      class nmrtr : IEnumerator<ulong>, IEnumerator, IDisposable {
        PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; ushort[] buf;
        public nmrtr() { for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc { buf = new ushort[BFSZ] };
          for (var s = 1u; s < NUMPRCSPCS; ++s) {
            ps[s].tsk = cullbf((s - 1u) * BFRNG, ps[s].buf, (bfr) => { }); } buf = ps[0].buf; }
        ulong _curr, i = (ulong)-WHLPTRN[WHTS - 1]; int b = -BWHLPRMS.Length - 1; uint wi = WHTS - 1; ushort v, msk = 0;
        public ulong Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
        public bool MoveNext() {
          if (b < 0) { if (b == -1) b += buf.Length; //no yield!!! so automatically comes around again
            else { this._curr = (ulong)BWHLPRMS[BWHLPRMS.Length + (++b)]; return true; } }
          do {
            i += WHLPTRN[wi++]; if (wi >= WHTS) wi = 0; if ((this.msk <<= 1) == 0) {
              if (++b >= BFSZ) { b = 0; for (var prc = 0; prc < NUMPRCSPCS - 1; ++prc) ps[prc] = ps[prc + 1];
                ps[NUMPRCSPCS - 1u].buf = buf;
                ps[NUMPRCSPCS - 1u].tsk = cullbf(i + (NUMPRCSPCS - 1u) * BFRNG, buf, (bfr) => { });
                ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
          while ((v & msk) != 0u); _curr = FSTBP + i + i; return true; }
        public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
        public void Dispose() { } }
      public IEnumerator<ulong> GetEnumerator() { return new nmrtr(); }
      IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); }
      static void IterateTo(ulong top_number, Action<ulong, uint, ushort[]> actn) {
        PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc {
          buf = new ushort[BFSZ], tsk = Task.Factory.StartNew(() => { }) };
        var topndx = (top_number - FSTBP) >> 1; for (ulong ndx = 0; ndx <= topndx; ) {
          ps[0].tsk.Wait(); var buf = ps[0].buf; for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
          var lowi = ndx; var nxtndx = ndx + BFRNG; var lim = topndx < nxtndx ? (uint)(topndx - ndx + 1) : BFRNG;
          ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf, tsk = cullbf(ndx, buf, (b) => actn(lowi, lim, b)) };
          ndx = nxtndx; } for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s].tsk.Wait(); }
      public static long CountTo(ulong top_number) {
        if (top_number < FSTBP) return BWHLPRMS.TakeWhile(p => p <= top_number).Count();
        var cnt = (long)BWHLPRMS.Length;
        IterateTo(top_number, (lowi, lim, b) => { Interlocked.Add(ref cnt, count(lim, b)); }); return cnt; }
      public static ulong SumTo(uint top_number) {
        if (top_number < FSTBP) return (ulong)BWHLPRMS.TakeWhile(p => p <= top_number).Aggregate(0u, (acc, p) => acc += p);
        var sum = (long)BWHLPRMS.Aggregate(0u, (acc, p) => acc += p);
        Func<ulong, uint, ushort[], long> sumbf = (lowi, bitlim, buf) => {
          var acc = 0L; for (uint i = 0, wi = 0, msk = 0x8000, w = 0, v = 0; i < bitlim;
              i += WHLPTRN[wi++], wi = wi >= WHTS ? 0 : wi) {
            if (msk >= 0x8000) { msk = 1; v = buf[w++]; } else msk <<= 1;
            if ((v & msk) == 0) acc += (long)(FSTBP + ((lowi + i) << 1)); } return acc; };
        IterateTo(top_number, (pos, lim, b) => { Interlocked.Add(ref sum, sumbf(pos, lim, b)); }); return (ulong)sum; }
      static void IterateUntil(Func<ulong, ushort[], bool> prdct) {
        PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS];
        for (var s = 0u; s < NUMPRCSPCS; ++s) { var buf = new ushort[BFSZ];
          ps[s] = new PrcsSpc { buf = buf, tsk = cullbf(s * BFRNG, buf, (bfr) => { }) }; }
        for (var ndx = 0UL; ; ndx += BFRNG) {
          ps[0].tsk.Wait(); var buf = ps[0].buf; var lowi = ndx; if (prdct(lowi, buf)) break;
          for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
          ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf,
                                             tsk = cullbf(ndx + NUMPRCSPCS * BFRNG, buf, (bfr) => { }) }; } }
      public static ulong ElementAt(long n) {
        if (n < BWHLPRMS.Length) return (ulong)BWHLPRMS.ElementAt((int)n);
        long cnt = BWHLPRMS.Length; var ndx = 0UL; var cycl = 0u; var bit = 0u; IterateUntil((lwi, bfr) => {
          var c = count(BFRNG, bfr); if ((cnt += c) < n) return false; ndx = lwi; cnt -= c; c = 0;
          do { var w = cycl++ * WPC; c = CLUT[bfr[w++]] + CLUT[bfr[w++]] + CLUT[bfr[w]]; cnt += c; } while (cnt < n);
          cnt -= c; var y = (--cycl) * WPC; ulong v = ((ulong)bfr[y + 2] << 32) + ((ulong)bfr[y + 1] << 16) + bfr[y];
          do { if ((v & (1UL << ((int)bit++))) == 0) ++cnt; } while (cnt <= n); --bit; return true;
        }); return FSTBP + ((ndx + cycl * WCRC + WHLPOS[bit]) << 1); } }
    

The above code takes about 59 milliseconds to find the primes to two million (slightly slower than some of the other simpler codes due to initialization overhead), but calculates the primes to one billion and the full number range in 1.55 and 5.95 seconds, respectively. This isn't much faster than the last version due to the DotNet extra overhead of an extra array bound check in the enumeration of found primes compared to the time expended in culling composite numbers which is less than a third of the time spent emumerating, so the saving in culling composites is cancelled out by the extra time (due to an extra array bounds check per prime candidate) in the enumeration. However, for many tasks involving primes, one does not need to enumerate all primes but can just compute the answers without enumeration.

For the above reasons, this class provides the example static methods "CountTo", "SumTo", and "ElementAt" to count or sum the primes to a given upper limit or to output the zero-based nth prime, respectively. The "CountTo" method will produce the number of primes to one billion and in the 32-bit number range in about 0.32 and 1.29 seconds, respectively; the "ElementAt" method will produce the last element in these ranges in about 0.32 and 1.25 seconds, respectively, and the "SumTo" method produces the sum of all the primes in these ranges in about 0.49 and 1.98 seconds respectively. This program calculates the sum of all the prime numbers to four billion plus as here in less time than many naive implementations can sum all the prime numbers to two million as in Euler Problem 10, for over 2000 times the practical range!

This code is only about four times slower than very highly optimized C code used by primesieve, and the reasons it is slower are mostly due to DotNet, as follows (discussing the case of a 256 Kilobyte buffer, which is the size of the L2 cache):

  1. Most of the execution time is spent in the main composite culling loop, which is the last "for loop" in the private static "cull" method" and only contains four statements per loop plus the range check.
  2. In DotNet, this compiles to take about 21.83 CPU clock cycles per loop, including about 5 clock cycles for the two array bounds checks per loop.
  3. The very efficient C compiler converts this loop into only about 8.33 clock cycles for an advantage of about 2.67 times.
  4. Primesieve also uses extreme manual "loop unrolling" optimizations to reduce the average time to perform the work per loop to about 4.17 clock cycles per composite cull loop, for a additional gain of two times and a total gain of about 5.3 times.
  5. Now, the highly optimized C code both doesn't Hyper Thread (HT) as well as the less efficient Just In Time (JIT) compiler produced code and as well, the OpemMP multi-threading used by primesieve doesn't appear to be as well adapted to this problem as use of the Thread Pool threads here, so the final multi-threaded gain in only about four times.
  6. One might consider the use of "unsafe" pointers to eliminate the array bounds checks, and it was tried, but the JIT compiler does not optimize pointers as well as normal array based code, so the gain of not having array bound checks is cancelled by the less efficient code (every pointer access (re)loads the pointer address from memory rather than using a register already pointing to that address as in the optimized array case).
  7. Primesieve is even faster when using smaller buffer sizes as in the size of the available L1 cache (16 Kilobytes when multi-threading for the i3/i5/i7 CPU's) as its more efficient code has more of an advantage reducing the average memory access time to one clock cycle from about four clock cycles, which advantage makes much less of a difference to the DotNet code, which gains more from less processing per reduced number of pages. Thus, primesieve is about five times faster when each use their most efficient buffer size.

This DotNet code will count (CountTo) the number of primes to ten to the power thirteen (ten trillion) in about an hour and a half (tested) and the number of primes to a hundred trillion (ten to the fourteenth) in just over a half day (estimated) as compared to twenty minutes and under four hours for primesieve, respectively. This is interesting historically as until 1985 only the count of primes in the range to ten to the thirteenth was known since it would take too much execution time on the (expensive) supercomputers of that day to find the range ten times larger; now we can easily compute the number of primes in those ranges on a common desktop computer (in this case, an Intel i7-2700K - 3.5 GHz)!

Using this code, it is easy to understand why Professor Atkin and Bernstein thought that the SoA was faster than the SoE - a myth that persists to this day, with the reasoning as follows:

  1. It is easy to have any SoA implementation count the number of state toggles and square free composite number culls (which last can be optimized using the same 2,3,5 wheel optimization as used inherently by the SoA algorithm) to determine that the total number of both of these operations is about 1.4 billion for the 32-bit number range.
  2. Bernstein's "equivalent" SoE implementation to his SoA implementation (neither of which are as optimized as this code), which uses the same 2,3,5 wheel optimization as the SoA, will have a total of about 1.82 billion cull operations with the same cull loop computational complexity.
  3. Therefore, Bernstein's results of about 30% better performance as compared to his implementation of the SoE is about right, just based on the number of equivalent operations. However, his implementation of the SoE did not take wheel factorization "to the max" since the SoA does not much respond to further degrees of wheel factorization as the 2,3,5 wheel is "baked in" to the basic algorithm.
  4. The wheel factorization optimizations used here reduces the number of composite cull operations to about 1.2 billion for the 32-bit number range; therefore, this algorithm using this degree of wheel factorization will run about 16.7% faster than an equivalent version of SoA since the culling loop(s) can be implemented about the same for each algorithm.
  5. The SoE with this level of optimizations is easier to write than an equivalent SoA as there needs only be one state table look up array to cull across the base primes instead of an additional state look up array's for each of the four quadratic equation solution cases that produce valid state toggles.
  6. When written using equivalent implementations as this code, the code will respond equivalently to C compiler optimizations for SoA as for the SoE used in primesieve.
  7. The SoA implementation will also respond to the extreme manual "loop unrolling" optimization used by primesieve just as effectively as for the SoE implementation.
  8. Therefore, if I went though all the work to implement the SoA algorithm using the techniques as for the above SoE code, the resulting SoA would only be a slight amount slower when the output primes were enumerated but would be about 16.7% slower when using the static direct methods.
  9. Memory footprint of the two algorithms is no different as both require the representation of the base primes and the same number of segment page buffers.

EDIT_ADD: Interestingly, this code runs 30% faster in x86 32-bit mode than in x64 64-bit mode, likely due to avoiding the slight extra overhead of extending the uint32 numbers to ulong's. All of the above timings are for 64-bit mode. END_EDIT_ADD

In summary: The segment paged Sieve of Atkin is actually slower than a maximally optimized segment paged Sieve of Eratosthenes with no saving in memory requirements!!!

I say again: "Why use the Sieve of Atkin?".

0
GordonBGood On

This is a more explicit answer to the question(s) raised, as follows:

Is there a way (the incremental Sieve of Atkin - GBG) can be done?

Of course the Sieve of Atkin (SoA) can be implemented as a segmented algorithm and in fact does not require the use of segmenting arrays as suggested at all but rather can be done just using raw sequences as I have done functionally using F#, although this is quite a bit slower than the extra efficiency permitted by using mutable arrays for the culling.

I am thinking it could go something like this: 1. Start with some trivial limit 2. Find all the primes up to the limit 3. Find all the primes up to the limit 4. Yield all the newfound primes 5. Increase the limit (by doubling or squaring the old limit or something like that) 6. Goto step 2

The above algorithm can be implemented in at least two different ways: 1) save the state of 'x' and 'y' for each value of 'x' when the sequences "run off" the current segment and start up again with those values for the next segment, or 2) calculate the applicable pair values of 'x' and 'y' to use for the new segment. Although the first way is simpler, I recommend the second method for two reasons: 1) it doesn't use memory for all the (many) values of x and y that must be saved and only a representation of the base primes must be saved in memory for the "squares free" culling step, and 2) it opens the door to using multi-threading and assigning independent thread operations for each page segment for large savings in time on a multi-processor computer.

And indeed, a better understanding of 'x' and 'y' is necessary:

My main problem is that I don't quite understand what x and y for example is in this algorithm. Like, could I just use the same algorithm kind of but set x and y to oldLimit (initially 1) and run it up to newLimit?

There has been one answer addressing this but perhaps it isn't explicit enough. It may be easier to think of these quadratic equations as a potentially infinite sequence of sequences where one of 'x' or 'y' is fixed starting from their lowest values and the other variable produces all of the elements per sequence. For instance, one could consider the odd expression of the quadratic equation "4*x^2 + y^2" as the sequence of sequences starting with 5, 17, 37, 65, ..., and with each of these sequences have elements as in {5, 13, 29, 53, ...}, {17, 25, 41, 65, ...}, {37, 45, 61, 85, ...}, {65, 73, 89, 125, ...}, ... Obviously, some of these elements are not prime as they are composites of 3 or 5 and that is why these need to be eliminated by either the modulo test, or alternatively as in the Bernstein implementation, these can be skipped automatically by recognizing patterns in the modulo arithmetic for the generators so they never actually even appear.

Implementing the first easier way of producing a segmented version of the SoA then just requires saving the state of each of the sequence of sequences, which is basically what is done in the F# incremental implementation (although using a folded tree structure for efficiency), which could easily be adapted to working over an array page range. For the case where the sequence state is computed at the beginning of every segment page, one just needs to compute how many elements would fit into the space up to the number represented by the lowest element in the new segment page for each "active" sequence, where "active" means a sequence whose starting element is lower than the number represented by the start index of the segment page.

As for pseudo-code on how to implement the segmentation of an array based SoA sieve, I have written something up for a related post, which shows how this can be accomplished.

The point of this is so that I won't have to set that limit. So that I can for example use Linq and just Take() however many primes I need, not worrying about if the limit is high enough, et cetera.

As stated in another answer, you could accomplish this end goal by just setting the maximum "limit" as a constant in your code, but this would be quite inefficient for small ranges of primes as culling would be taking place over a much larger array than necessary. As already stated, other than for better efficiency and reducing memory use by a huge factor, segmentation also has other advantages in permitting the efficient use of multi-processing. However, use of the Take(), TakeWhile(), Where(), Count(), etcetera, methods won't provide very good performance for large ranges of primes as their use involves many stacked method calls per element at many clock cycles per call and return. But you would have the option of either using these or more imperative forms of program flow, so this isn't a real objection.

0
GordonBGood On

Here is a solution to unbounded prime sieving in C#, which can be implemented using the Sieve of Eratosthenes (SoE) or the Sieve of Atkin (SoA) algorithms; however, I maintain that it is hardly worth the extreme complexity of an optimized SoA solution given than the true SoE gives about the same performance without as much complexity. Thus, this is perhaps only a partial answer in that, while it shows how to implement a better SoA algorithm and shows how to implement an unbounded sequence using SoE, it only hints at how to combine these to write a reasonably efficient SoA.

Note that if only discussion on the very fastest implementations of these ideas is desired, jump to the bottom of this answer.

First we should comment on the point of this exercise in producing an unbounded sequence of primes to permit using the IEnumerable extension methods such as Take(), TakeWhile(), Where(), Count(), etcetera, as these methods give away some performance due to the added levels of method calling, adding at least 28 machine cycles for every call and return to enumerate one value and adding several levels of method calls per function; that said, having an (effectively) infinite sequence is still useful even if one uses more imperative filtering techniques on the results of the enumerations for better speed.

Note, in the the simpler examples following, I have limited the range to that of unsigned 32-bit numbers (uint) as much past that range the basic SoE or SoA implementations aren't really appropriate as to efficiency and one needs to switch over to a "bucket" or other form of incremental sieve for part of the sieving for efficiency; however, for a fully optimized page segmented sieve as in the fastest implementation, the range is increased to the 64-bit range although as written one likely would not want to sieve past about a hundred trillion (ten to the fourteenth power) as run time would take up to hundreds of years for the full range.

As the question chooses the SoA for probably the wrong reasons in first mistaking a Trial Division (TD) prime sieve for a true SoE and then using a naive SoA over the TD sieve, let's establish the true bounded SoE which can be implemented in a few lines as a method (which could be converted to a class as per the question's implementation coding style) as follows:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  var BFLMT = (top_number - 3u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 3u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  for (var i = 0u; i <= BFLMT; ++i) if (buf[(int)i]) {
      var p = 3u + i + i; if (i <= SQRTLMT) {
        for (var j = (p * p - 3u) / 2u; j <= BFLMT; j += p)
          buf[(int)j] = false; } yield return p; } }

This calculates the primes to 2 million in about 16 milliseconds on an i7-2700K (3.5 GHz) and the 203,280,221 primes up to 4,294,967,291 in the 32-bit number range in about 67 seconds on the same machine (given a spare 256 MegaBytes of RAM memory).

Now, using the version above to compare to the SoA is hardly fair as the true SoA automatically ignores the multiples of 2, 3, and 5 so introducing wheel factorization to do the same for the SoE yields the following code:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  yield return 5u; if (top_number < 7u) yield break;
  var BFLMT = (top_number - 7u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 7u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 };
  for (uint i = 0u, w = 0u; i <= BFLMT; i += WHLPTRN[w], w = (w < 7u) ? ++w : 0u)
    if (buf[(int)i]) { var p = 7u + i + i; if (i <= SQRTLMT) {
        var pX2 = p + p; uint[] pa = { p, pX2, pX2 + p };
        for (uint j = (p * p - 7u) / 2u, m = w; j <= BFLMT;
                               j += pa[WHLPTRN[m] - 1u], m = (m < 7u) ? ++m : 0u)
          buf[(int)j] = false; } yield return p; } }

The above code calculates the primes to 2 million in about 10 milliseconds and the primes to the 32-bit number range in about 40 seconds on the same machine as above.

Next, let's establish whether a version of the SoA that we are likely to write here actually has any benefit as compared to the SoE as per the above code as far as execution speed goes. The code below follows the model of the SoE above and optimizes the SoA pseudo-code from the Wikipedia article as to tuning the ranges of the 'x' and 'y' variables using separate loops for the individual quadratic solutions as suggested in that article, solving the quadratic equations (and the square free eliminations) only for odd solutions, combining the "3*x^2" quadratic to solve for both the positive and negative second terms in the same pass, and eliminating the computationally expensive modulo operations, to produce code that is over three times faster than the naive version of the pseudo-code posted there and as used in the question here. The code for the bounded SoA is as then follows:

static IEnumerable<uint> primesSoA(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  var BFLMT = (top_number - 3u) / 2u; var buf = new BitArray((int)BFLMT + 1, false);
  var SQRT = (uint)(Math.Sqrt((double)top_number)); var SQRTLMT = (SQRT - 3u) / 2u;
  for (uint x = 1u, s = 1u, mdx12 = 5u, dmdx12 = 0u; s <= BFLMT; ++x, s += ((x << 1) - 1u) << 1) {
    for (uint y = 1u, n = s, md12 = mdx12, dmd12 = 8u; n <= BFLMT; y += 2, n += (y - 1u) << 1) {
      if ((md12 == 1) || (md12 == 5)) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8u; if (dmd12 >= 12u) dmd12 -= 12u; }
    mdx12 += dmdx12; if (mdx12 >= 12u) mdx12 -= 12u; dmdx12 += 8u; if (dmdx12 >= 12u) dmdx12 -= 12u; }
  for (uint x = 1u, s = 0u, mdx12 = 3u, dmdx12 = 8u; s <= BFLMT; ++x, s += x << 1) {
    int y = 1 - (int)x, n = (int)s, md12 = (int)mdx12, dmd12 = ((-y - 1) << 2) % 12;
    for (; (y < 0) && (uint)n <= BFLMT; y += 2, n += (-y + 1) << 1) {
      if (md12 == 11) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 4; if (dmd12 >= 12) dmd12 -= 12; }
    if (y < 1) { y = 2; n += 2; md12 += 4; dmd12 = 0; } else { n += 1; md12 += 2; dmd12 = 8; }
    if (md12 >= 12) md12 -= 12; for (; (uint)n <= BFLMT; y += 2, n += (y - 1) << 1) {
      if (md12 == 7) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8; if (dmd12 >= 12) dmd12 -= 12; }
    mdx12 += dmdx12; if (mdx12 >= 12) mdx12 -= 12; dmdx12 += 4; if (dmdx12 >= 12) dmdx12 -= 12; }
  for (var i = 0u; i<=BFLMT; ++i) if (buf[(int)i]) { var p = 3u+i+i; if (i<=SQRTLMT) { var sqr = p*p;
        for (var j = (sqr - 3ul) / 2ul; j <= BFLMT; j += sqr) buf[(int)j] = false; } yield return p; } }

This is still over twice as slow as the wheel factorization SoE algorithm posted due to the not fully optimized number of operations. Further optimizations can be made to the SoA code as in using modulo 60 operations as for the original (non-pseudo-code) algorithm and using bit packing to only deal with multiples excluding 3 and 5 to get the code fairly close in performance to SoE or even exceed it slightly, but we get closer and closer to the complexity of the Berstein implementation to achieve this performance. My point is "Why SoA, when one works very hard to get about the same performance as SoE?".

Now for the unbounded primes sequence, the very easiest way to do this is to define a const top_number of Uint32.MaxValue and eliminate the argument in the primesSoE or primesSoA methods. This is somewhat inefficient in that the culling is still done over the full number range no matter the actual prime value being processed, which makes the determination for small ranges of primes take much longer than it should. There are also other reasons to go to a segmented version of the primes sieve other than this and extreme memory use: First, algorithms that use data that is primarily within the CPU L1 or L2 data caches process faster because of more efficient memory access, and secondly because segmentation allows the job to be easily split into pages that can be culled in the background using multi-core processors for a speed-up that can be proportional to the number of cores used.

For efficiency, we should choose an array size of the CPU L1 or L2 cache size which is at least 16 Kilobytes for most modern CPU's in order to avoid cache thrashing as we cull composites of primes from the array, meaning that the BitArray can have a size of eight times that large (8 bits per byte) or 128 Kilobits. Since we only need to sieve odd primes, this represents a range of numbers of over a quarter million, meaning that a segmented version will only use eight segments to sieve to 2 million as required by Euler Problem 10. One could save the results from the last segment and continue on from that point, but that would preclude adapting this code to the multi-core case where one relegates culling to the background on multiple threads to take full advantage of multi-core processors. The C# code for an (single thread) unbounded SoE is as follows:

static IEnumerable<uint> primesSoE() { yield return 2u; yield return 3u; yield return 5u;
  const uint L1CACHEPOW = 14u + 3u, L1CACHESZ = (1u << (int)L1CACHEPOW); //for 16K in bits...
  const uint BUFSZ = L1CACHESZ / 15u * 15u; //an even number of wheel rotations
  var buf = new BitArray((int)BUFSZ);
  const uint MAXNDX = (uint.MaxValue - 7u) / 2u; //need maximum for number range
  var SQRTNDX = ((uint)Math.Sqrt(uint.MaxValue) - 7u) / 2u;
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  byte[] WHLNDX = { 0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, //get index from position
                    0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7 }; //allow for overflow
  byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, //allow for overflow...
                      15, 15, 17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27 };
  uint BPLMT = (ushort.MaxValue - 7u) / 2u; var bpbuf = new BitArray((int)BPLMT + 1, true);
  for (var i = 0; i <= 124; ++i) if (bpbuf[i]) { var p = 7 + i + i; //initialize baseprimes array
      for (var j = (p * p - 7) / 2; j <= BPLMT; j += p) bpbuf[j] = false; } var pa = new uint[3];
  for (uint i = 0u, w = 0, si = 0; i <= MAXNDX;
        i += WHLPTRN[w], si += WHLPTRN[w], si = (si >= BUFSZ) ? 0u : si, w = (w < 7u) ? ++w : 0u) {
    if (si == 0) { buf.SetAll(true);
      for (uint j = 0u, bw = 0u; j <= BPLMT; j += WHLPTRN[bw], bw = (bw < 7u) ? ++bw : 0u)
        if (bpbuf[(int)j]) { var p = 7u+j+j; var pX2=p+p; var k = p * (j + 3u) + j;
          if (k >= i + BUFSZ) break; pa[0] = p; pa[1] = pX2; pa[2] = pX2 + p; var sw = bw; if (k < i) {
            k = (i - k) % (15u * p); if (k != 0) { var os = WHLPOS[bw]; sw = os + ((k + p - 1u) / p);
              sw = WHLRNDUP[sw]; k = (sw - os) * p - k; sw = WHLNDX[sw]; } } else k -= i;
          for (; k<BUFSZ; k+=pa[WHLPTRN[sw]-1], sw=(sw<7u) ? ++sw : 0u) buf[(int)k]=false; } }
    if (buf[(int)si]) yield return 7u + i + i; } }

The above code takes about 16 milliseconds to sieve the primes up to two million and about 30 seconds to sieve the full 32-bit number range. This code is faster than the similar version using the factorial wheel without segmentation for large number ranges because, even though the code is more complex as to computations, a great deal of time is saved in not thrashing the CPU caches. Much of the complexity is in the computation of the new starting indexes per base prime per segment, which could have been reduced by saving the state of each per prime per segment but the above code can easily be converted so as to run the culling processes on separate background threads for a further four times speed up for a machine with four cores, even more with eight cores. Not using the BitArray class and addressing the individual bit locations by bit masks would provide a further speed up by about a factor of two, and further increases of the factorial wheel would provide about another reduction in time to about two thirds. Better packing of the bit array in eliminated indexes for values eliminated by the wheel factorization would slightly increase efficiency for large ranges, but would also add somewhat to the complexity of the bit manipulations. However, all of these SoE optimizations would not approach the extreme complexity of the Berstein SoA implementation, yet would run at about the same speed.

To convert the above code from SoE to SoA, we only need to change to the SoA culling code from the bounded case but with the modification that the starting addresses are recalculated for every page segment something like as the starting address is calculated for the SoE case, but even somewhat more complex in operation as the quadratics advance by squares of numbers rather than by simple multiples of primes. I'll leave the required adaptations to the student, although I don't really see the point given that the SoA with reasonable optimizations is not faster than the current version of the SoE and is considerably more complex ;)

EDIT_ADD:

Note: The below code has been corrected, as while the original posted code was correct as to the provided primes sequence, it was half again slower than it needed to be as it culled for all numbers below the square root of the range rather than only the found base primes up to the square root of the range.

The most effective and simplest optimization is relegating the culling operations per segment page to background threads so that, given enough CPU cores, the main limit in enumerating the primes is the enumeration loop itself, in which case all the primes in the 32-bit number range can be enumerated in about ten seconds on the above reference machine (in C#) without other optimizations, with all other optimizations including writing the IEnumerable interface implementations rather than using the built-in iterators reducing this to about six seconds for all the 203,280,221 primes in the 32-bit number range (1.65 seconds to one billion), again with most of the time spent just enumerating the results. The following code has many of those modifications including using background tasks from the Dotnet Framework 4 ThreadPool used by Tasks, using a state machine implemented as a look up table to implement further bit packing to make the primes enumeration quicker, and re-writing the algorithm as an enumerable class using "roll-your-own" iterators for increased efficiency:

class fastprimesSoE : IEnumerable<uint>, IEnumerable {
  struct procspc { public Task tsk; public uint[] buf; }
  struct wst { public byte msk; public byte mlt; public byte xtr; public byte nxt; }
  static readonly uint NUMPROCS = (uint)Environment.ProcessorCount + 1u; const uint CHNKSZ = 1u;
  const uint L1CACHEPOW = 14u, L1CACHESZ = (1u << (int)L1CACHEPOW), PGSZ = L1CACHESZ >> 2; //for 16K in bytes...
  const uint BUFSZ = CHNKSZ * PGSZ; //number of uints even number of caches in chunk
  const uint BUFSZBTS = 15u * BUFSZ << 2; //even in wheel rotations and uints (and chunks)
  static readonly byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  static readonly byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  static readonly byte[] WHLNDX = { 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 0, 0, 0 }; //get index from position
  static readonly byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, 15, 15, //allow for overflow...
                                      17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27, 30, 30, 30 }; //round multiplier up
  const uint BPLMT = (ushort.MaxValue - 7u) / 2u; const uint BPSZ = BPLMT / 60u + 1u;
  static readonly uint[] bpbuf = new uint[BPSZ]; static readonly wst[] WHLST = new wst[64];
  static void cullpg(uint i, uint[] b, int strt, int cnt) { Array.Clear(b, strt, cnt);
    for (uint j = 0u, wp = 0, bw = 0x31321212u, bi = 0u, v = 0xc0881000u, bm = 1u; j <= BPLMT;
      j += bw & 0xF, wp = wp < 12 ? wp += bw & 0xF : 0, bw = (bw > 3u) ? bw >>= 4 : 0x31321212u) {
      var p = 7u + j + j; var k = p * (j + 3u) + j; if (k >= (i + (uint)cnt * 60u)) break;
      if ((v & bm) == 0u) { if (k < i) { k = (i - k) % (15u * p); if (k != 0) {
            var sw = wp + ((k + p - 1u) / p); sw = WHLRNDUP[sw]; k = (sw - wp) * p - k; } }
        else k -= i; var pd = p / 15;
        for (uint l = k / 15u + (uint)strt * 4u, lw = ((uint)WHLNDX[wp] << 3) + WHLNDX[k % 15u];
               l < (uint)(strt + cnt) * 4u; ) { var st = WHLST[lw];
          b[l >> 2] |= (uint)st.msk << (int)((l & 3) << 3); l += st.mlt * pd + st.xtr; lw = st.nxt; } }
      if ((bm <<= 1) == 0u) { v = bpbuf[++bi]; bm = 1u; } } }
  static fastprimesSoE() {
    for (var x = 0; x < 8; ++x) { var p = 7 + 2 * WHLPOS[x]; var pr = p % 15;
      for (int y = 0, i = ((p * p - 7) / 2); y < 8; ++y) { var m = WHLPTRN[(x + y) % 8]; i %= 15;
        var n = WHLNDX[i]; i += m * pr; WHLST[x * 8 + n] = new wst { msk = (byte)(1 << n), mlt = m,
                                                                     xtr = (byte)(i / 15),
                                                                     nxt = (byte)(8 * x + WHLNDX[i % 15]) }; }
    } cullpg(0u, bpbuf, 0, bpbuf.Length);  } //init baseprimes
  class nmrtr : IEnumerator<uint>, IEnumerator, IDisposable {
    procspc[] ps = new procspc[NUMPROCS]; uint[] buf;
    Task dlycullpg(uint i, uint[] buf) {  return Task.Factory.StartNew(() => {
        for (var c = 0u; c < CHNKSZ; ++c) cullpg(i + c * PGSZ * 60, buf, (int)(c * PGSZ), (int)PGSZ); }); }
    public nmrtr() {
      for (var i = 0u; i < NUMPROCS; ++i) ps[i] = new procspc { buf = new uint[BUFSZ] };
      for (var i = 1u; i < NUMPROCS; ++i) { ps[i].tsk = dlycullpg((i - 1u) * BUFSZBTS, ps[i].buf); } buf = ps[0].buf;  }
    public uint Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
    uint _curr; int b = -4; uint i = 0, w = 0; uint v, msk = 0;
    public bool MoveNext() {
      if (b < 0) if (b == -1) { _curr = 7; b += (int)BUFSZ; }
        else { if (b++ == -4) this._curr = 2u; else this._curr = 7u + ((uint)b << 1); return true; }
      do {  i += w & 0xF; if ((w >>= 4) == 0) w = 0x31321212u; if ((this.msk <<= 1) == 0) {
          if (++b >= BUFSZ) { b = 0; for (var prc = 0; prc < NUMPROCS - 1; ++prc) ps[prc] = ps[prc + 1];
            ps[NUMPROCS - 1u].buf = buf; var low = i + (NUMPROCS - 1u) * BUFSZBTS;
            ps[NUMPROCS - 1u].tsk = dlycullpg(i + (NUMPROCS - 1u) * BUFSZBTS, buf);
            ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
      while ((v & msk) != 0u); if (_curr > (_curr = 7u + i + i)) return false; else return true;  }
    public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
    public void Dispose() { }
  }
  public IEnumerator<uint> GetEnumerator() { return new nmrtr(); }
  IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); } }

Note that this code isn't faster than the last version for small ranges of primes as in up to one or two million due to the overhead of setting up and initializing arrays but is considerably faster for larger ranges up to four billion plus. It is about 8 times faster than the question's implementation of the Atkin sieve, but goes to from 20 to 50 times as fast for larger ranges up to four billion plus. There are defined constants in the code setting the base cache size and how many of these to cull per thread (CHNKSZ) that may slightly tweak the performance. Some slight further optimizations could be tried that might reduce the execution time for large primes by up to a factor of two such as further bit packing as for the 2,3,5,7 wheel rather than just the 2,3,5 wheel for a reduction of about 25% in packing (perhaps cutting the execution time to two thirds) and pre-culling the page segments by the wheel factorial up to the factor of 17 for a further reduction of about 20%, but these are about the extent of what can be done in pure C# code as compared to C code which can take advantage of unique native code optimizations.

At any rate, it is hardly worth further optimizations if one intends to use the IEnumberable interface for output as the question desires as about two thirds of the time is used just enumerating the found primes and only about one third of the time is spent in culling the composite numbers. A better approach would be to write methods on the class to implement the desired results such as CountTo, ElementAt, SumTo, etcetera so as to avoid enumeration entirely.

But I did do the extra optimizations as an additional answer which shows additional benefits in spite of additional complexity, and which further shows why one doesn't want to use the SoA because a fully optimized SoE is actually better.

0
oluckyman On

Yes, the sieve of Atkin can be made incremental. See the paper "Two compact incremental prime sieves" by Jonathan Sorenson.

4
jakber On

I can try to explain what x and y does, but I don't think you can do what you ask without restarting the loops from the beginning. This is pretty much the same for any "sieve"-algorithm.

What the sieve does is basically count how many different quadratic equations (even or odd) have each number as a solution. The specific equation checked for each number is different depending on what n % 12 is.

For example, numbers n which have a mod 12 remainder of 1 or 5 are prime if and only if the number of solutions to 4*x^2+y^2=n is odd and the number is square-free. The first loop simply loops through all possible values of x and y that could satisfy these different equations. By flipping the isPrime[n] each time we find a solution for that n, we can keep track of whether the number of solutions is odd or even.

The thing is that we count this for all possible n at the same time, which makes this much more efficient than checking one at the time. Doing it for just some n would take more time (because you would have to make sure that n >= lower_limit in the first loop) and complicate the second loop, since that one requires knowledge of all primes smaller than sqrt.

The second loop checks that the number is square-free (has no factor which is a square of a prime number).

Also, I don't think your implementation of the sieve of Atkin is necessarily faster than a straight-forward sieve of Eratosthenes would be. However, the fastest possible most optimized sieve of Atkin would beat the fastest possible most optimized sieve of Eratosthenes.