haru-prime

There was a light-hearted challenge on the Australian forum website whirlpool.net, which was "who can make the fastest prime generator".

I probably took it a bit more seriously than anyone else and found myself drawn into making the sieve faster and faster. I can probably best describe the experience by comparing it to fishing - when it seemed that nothing I could do could make my sieve any faster, I'd suddenly get a little nibble and would see another speed increase and it would take my interest again.

It has been very educational. I have learnt a lot about optimisation by challenging myself in this way. When I started I didn't even know to compile using optimisation flags (which makes the initial thread rather confusing).

haru_.oO is my name on whirlpool and so I am calling this haru-prime


I don't have much free time at all, period. However I recently bought a i7-5820K and overclocked it to 4.4GHz and so I dusted off my prime project to see if I could make use of any of the new avx2 functions. (I was actually pretty disappointed).

However I did take the opportunity to try a few more things and then suddenly, on the bus one day I made this small tweak and suddenly my sieve was running faster than primeseive calculating primes to 10^9!

primesieve had always been faster in what I considered the "core" generator - calculating primes to 10^9 which only requires sieving primes less than the width of a block. So primeseive is still faster for the higher numbers, and my sieve is not yet multi-threaded. I didn't see the point in exploring these areas while the core generator was still slower.

So I'm pretty excited at the moment as it is a huge milestone for me and thought I'd take the chance to clean up the code enough to put out there on the interwebs.

Some caveats:

Just to clarify again, primesieve is a fully featured project, this is just a hobby project. If you plan on using something, you probably want primesieve. If you want to play with making a faster prime generator then hopefully this is a good place to start

Comparison with primesieve-5.4.2

First these are the times that primesieve-5.4.2 gives on my two boxes when simply built running the configure/make scripts. Note that primesieve takes a while to load, so the time is different if running "time ./count_primes_c" vs timing how long it takes to run everything in "main" (I don't know why). Practically this means that if you need to calculate the primes to 10^9 lots of times you would take the smaller number. If you only want to do it once, take the larger number.

(For primesieve I modified the example count_primes_c to take the start and end times as arguments and run single threaded. I also added some timers around the function call).

Timings (ms):
Primeseive-5.4.2
BoxTime to 10^9 (with load)Time to 10^9Time to 10^10 (with load)Time to 10^10
Haswell 4.4G12411616341624
Sandy laptop18716122362209
haru-prime
Haswell 4.4G908915561559
Sandy laptop15915624462444

Obviously I am excited with the improvement my sieve had on my Haswell 5820K (and even my old laptop) when finding primes to 10^9. But it really is slower once you get to 10^10, let alone something like 10^16. Also I haven't multi-threaded my sieve yet.

However, I stated above I have actually been focussing on the base first. I have tons of ideas for moving in to larger primes. Unfortunately I just don't have the time to try any of them out.

How it works - Sieve Basics

Bitmap for marking primes

The basic idea is to have a bitmap to represent all of the target numbers. The bitmap is initially 0 and then if a number is found to be not prime, then that bit is set to 1. Once everything is done the bitmap is read to determine which numbers are prime (still set to 0).

Sieve

The idea of the sieve is to iterate through all primes to the sqrt(MAX). For each prime, iterate through the multiples starting at prime*prime and mark them off on the bitmap.

8/30 wheel

After the number 30, there are only 8 numbers that could possibly be prime in every 30 numbers. Everything else will be a multiple of 2,3 or 5.

The bitmap is changed so that each byte (8 bits) represents the 8 possibly prime numbers for each set of 30 numbers.

Note that other wheels are possible and I initially explored many other wheels. However, the fact that there are 8 bits per byte, and 8 possible primes in every 30 numbers leads to some really big optimisations so this is preferred.

Block at a time

Instead of representing all target numbers at once in the one bitmap, the problem is broken up into groups of numbers, or block of numbers. The size is chosen so the bitmap will fit well inside the CPU caches. Each block of numbers is then processed in turn using the bitmap, with the memory reused to calculate the next block.

Optimisations for different groups of sieving primes

Smaller primes have a lot more multiples within any one block. However there are a many more larger primes to iterate through than smaller primes. Needless to say there are different approaches for different groups of sieving primes.

Specific Optimisations

A simple list of some of the optimisations

0. Slow

For a comparison, the slow sieve iterates all relevant a * b pairs for each block and marks the result non-prime.

'b' is incremented according to the "possible primes" in the 8/30 wheel. That is, b is never a multiple of 2,3,5. This actually adds a lot of complexity.

Psuedo-code

   for each prime a to sqrt(BLOCK_END)
      get b such that b is possibly prime AND a*b > BLOCK_START AND b > a
      while (a*b < BLOCK_END)
         set bit at a*b
         b = the next possible prime

Timings (ms):
BoxTime to 10^9Time to 10^10
Haswell 4.4G95511210
Sandy laptop185020634

1. Simple

This is called simple because it mimics the slow method but introduces one major optimisation. This optimisation is probably the main reason for choosing the 8/30 wheel.

The main optimistaions compares to the "slow" implementation are:

This is achieved by noting:

   a_byte    = a/30
   a_bitval  = a%30

   a * b = (a_byte * 30 + a_bitval) * (b_byte * 30 + b_bitval)
         =   a_byte   * b_byte   * 30 * 30
           + a_byte   * b_bitval * 30
           + a_bitval * b_byte   * 30
           + a_bitval * b_bitval

From here, only the last term influences which 'bit' to set. Ie, all of the first terms are multiples of 30 meaning full bytes.

The optimisation is to choose 'b_bit', but then only increment b_byte while marking off possible 'a * b' within the block. This means that the resulting bit mask does not change for that run as a_bitval and b_bitval are constant. This step is then done for each of the 8 possible 'b_bit'.

We can continue changing the above function to note:

   a * b = a * (b_byte * 30 + b_bit)
         = a * b_byte * 30 + a * b_bit

So if 'a' is constant and 'b_bit' is constant, then incrementing b_byte means that the next result can be achieved by adding 'a' * 30 to the result. 30 means a full byte, so it means that once the first offset is found, that offset just needs to be incremented by 'a'.

So finally, the psudocode is:

   For each prime 'a' to sqrt(BLOCK_END)
      get b such that b is possibly prime AND a*b > BLOCK_START AND b > a
      b_byte = b/30;

      for each bit (0,1,2,3,4,5,6,7)

         get b = b_byte + wheel value of bit;
         res = a*b
         res_byte = res/30
         res_bit = wheel bit associated with res % 30

         while res_byte is in the block:
            apply res_bit mask to res_byte
            increment res_byte by 'a'

Timings (ms):
BoxTime to 10^9Time to 10^10
Haswell 4.4G3225000
Sandy laptop5408240

2. Store offsets for middle primes

This follows the "Simple" method above with the following optimisations:

Unlike the previous methods this has some limitations

Note that there are 6542 primes below 2^16 which at 20 bytes per prime is 127K. This can fit in L2 cache but doesn't leave much room. It certainly can't fit in the L1 cache along with our bitmap.

I've constantly been surprised at how fast even the L2 cache is. My guess is that because the access is sequential the CPU is smart enough to load them in and expire them from the cache in a good order.

Psuedo-code

   For each prime  to sqrt(BLOCK_END)

      load initial offsets

      n = block_size / prime

      while (n--)
         for each bit
            set *offset[bit] |= bitmask[bit]
            offset[bit] += prime

      for each bit
         if (offset[bit] < BLOCK_END)
            set *offset[bit] |= bitmask[bit]
            offset[bit] += prime;

      store new offsets

Timings (ms):
BoxTime to 10^9Time to 10^10
Haswell 4.4G1522519
Sandy laptop2123520

3. Store offsets for Lower upper primes

This replicates the "simple_middle" method for primes > 2^16.

This is far from a good solution as it requires a lot of space. However, surpisingly it does improve the speeds a little.

I have put an arbitrary limit of primes to 400000. With 27318 primes between 2^16 and 400000, and with 36 bytes per prime this is almost 1MB of memory. So much to large for the L2 cache but small enough for the L3.

Again I am surprised that it offers so much benefit over calculating the offsets. Presumably the CPU can use the caches in a smart way because they are accessed sequentially. However, I'm pretty confident I can do better here once I get some more time.

Psuedo-code

   For each prime

      load initial offsets

      for each bit
         if (offset[bit] < BLOCK_END)
            set *offset[bit] |= bitmask[bit]
            offset[bit] += prime;

      store new offsets

Timings (ms):
BoxTime to 10^9Time to 10^10
Haswell 4.4G1522295
Sandy laptop2123046

4. Unaligned loads for lower primes

Optimisation for lower primes < 64

It focuses on the cycles produced by smaller primes and works with ymm registers

Optimisations

Limitations

Preamble: Every prime will set 8 bits in PRIME bytes. The next cycle is exactly the same pattern. So for the prime 7, you can store the complete cycle in just 7 bytes, and you could copy 7 bytes, then 7 more bytes etc.

However the load/or/store operations would normally work on 1,2,4,8,16,32 bytes. So instead you can take say, 8x7=56 bytes, then these 8 (7-byte) cycles can be written out repeatedly as 7 (8-byte) ints.

With 32-byte registers you instead represent 32 (7-byte) cycles as 7 (32-byte) vectors.

There are 16 ymm registers so this works very well for the primes 7,11,13

Psuedo-code for prime 7 (Note prime 11 and 13 do an *out++ |= i0 instead)

  v32ui i0 = cycle[0];
  v32ui i1 = cycle[1];
  v32ui i2 = cycle[2];
  v32ui i3 = cycle[3];
  v32ui i4 = cycle[4];
  v32ui i5 = cycle[5];
  v32ui i6 = cycle[6];

  out = output_buffer;
  n = block_size / (32*7)  (but rounded up - note it is ok to overwrite the buffer)
  while (n--)
     *out++ = i0;
     *out++ = i1;
     *out++ = i2;
     *out++ = i3;
     *out++ = i4;
     *out++ = i5;
     *out++ = i6;

For prime 17 onwards there are not enough registers for this approach

Returning to the idea of a cycle for prime 17 completing in 17 bytes, there are 17 unique byte start positions. So if we repeat the pattern for 32+17 bytes the correct pattern can be loaded into a 32 byte register using an offset of 0-16. Originally I had an initial buffer and would use "shuffle" to rotate the buffer. Unfortunately there is not native shuffle across all 32 bytes in avx2. Instead it was advised to use the unaligned loads to perform a rotate and that is essentially what I have done.

Because we only do a single load at a time we are only really using one register, so instead I load the next part for a bunch of primes at a time and 'or' the results together before writing the output vector.

Psuedo-code for primes between 17 and 32

   for each prime
      Copy cycles to a stack-based array (seemed to help)
      Get initial load offset

   n = block_size/32

   while (n--)
      for each prime
         load a ymm register at the load offset
         out |= ymm
         recalc the load offset

      out++

For primes between 32 and 64 the same thing can be done but they need at least 2x32 byte blocks to cover a single cycle.

For primes between 64 and 96 the same thing can be done but they need at least 3x32 byte blocks to cover a single cycle.

Psuedo-code for primes between 32 and 64

   for each prime
      Copy cycles to a stack-based array (seemed to help)
      Get initial load offset

   n = block_size/64

   while (n--)
      for each prime
         load a ymm register at the load offset
         out |= ymm
         load a ymm register at the load offset + 32
         (out + 1) |= ymm
         recalc the load offset for each prime

      out += 2;

Timings (ms):
BoxTime to 10^9Time to 10^10
Haswell 4.4G1151919
Sandy laptop1762715

5. Read stored offset but don't write for middle and lower_upper primes

In the simple middle plan there is ~128K bytes used to store offset information

The thing is that this needs to be both read and written (or so my theory went).

As above for the unaligned loads for lower primes - the same pattern will repeat every PRIME bytes. So instead of changing each of the offets each time, I just wanted to store the relative offsets and then increment the single base offset.

Initially this method didn't prove to be any faster. However, after a bit of fiddling it jumped ahead. The fiddle was to simply allow the primes to overwrite the prime buffer (up to 32K over!) instead of checking if it was valid. This wasn't possible in the simple middle method because the offset for the next block had to be exact.

Another optimisation was to directly handle all primes that divide into the block size 10x (for example) in a separate function with an unrolled loop. (For example there are ~1600 primes that divide once into 32K).

One thing I'm excited about is because it is read-only it can be shared between threads when I go multi-threaded

Psuedo-code

   For each prime

      char *offs[];

      load relative offsets
      load adjustment offset

      for each bit
         set offs[bit] using offsets[bit] and the adjustment offset
         if (offs[bit] < BLOCK_START)
            offs[bit] += prime;

      n = block_size/prime (rounded up)

      while (n--)
         for each bit
            set *offs[bit] |= bitmask[bit]
            offs[bit] += prime;

     calc next adjustment offset
     store next adjustment offset;

I also extended this to primes < 400000. It showed no speed improvement but it does mean that if I do go threaded the offsets can be shared.

Psuedo-code primes > 2^15 and < 400000

   For each prime

      load relative offsets
      load adjustment offset

      for each bit
         off = offsets[bit] adjusting for adjustment offset
         if off is in this block
            set *(block + off) |= bitmask[bit]

     calc next adjustment offset
     store next adjustment offset;

Timings (ms):
BoxTime to 10^9Time to 10^10
Haswell 4.4G891557
Sandy laptop1502397

About the Author

My name is Tristan Verniquet, I have a beautiful Korean wife and lovely 2yo daughter. I consider myself a strong Christian and try to put God first in everything I do. As such I dedicate this program to God, and to my wife and daughter who have allowed me just a little more time from what little I have to play just a little bit longer..

"Do you see someone skilled in their work?
        They will serve before kings;" Proverbs 22:29