After reviewing the Intel Digital Random Number Generator (DRNG) Software Implementation Guide, I have a few questions about what happens to the internal state of the generator when RDRAND
is invoked. Unfortunately the answers don't seem to be in the guide.
According to the guide, inside the DRNG there are four 128-bit buffers that serve random bits for
RDRAND
to drain.RDRAND
itself will provide either 16, 32, or 64 bits of random data depending on the width of the destination register:rdrand ax ; put 16 random bits in ax rdrand eax ; put 32 random bits in eax rdrand rax ; put 64 random bits in rax
Will the use of larger destination registers empty those 128-bit buffers more quickly? For example, if I need only 2 bits of randomness, should I go through the trouble of using a 16 bit register over a 64 bit register? Will that make any difference on the throughput of the DRNG? I'd like to avoid consuming more randomness than is necessary.
The guide says the carry flag will be set after
RDRAND
executes:CF = 1 Destination register valid. Non-zero random value available at time of execution. Result placed in register. CF = 0 Destination register all zeros. Random value not available at time of execution. May be retried.
What does "not available" mean? Can random data be unavailable because
RDRAND
invocations exhausted those 128-bit buffers too quickly? Or does unavailable mean the DRNG is failing its health checks and cannot generate any new data? Basically, I'm trying to understand if CF=0 can occur just because the buffers happen to be (transiently) empty whenRDRAND
is invoked.
Note: I have reviewed the answers to this question on throughput and latency of RDRAND, but I'm seeking different information.
Thanks!
Part 1. Does it make a difference pulling 16, 32 or 64 bits?
No.
On Ivy Bridge, the CPU cores pull 64 bits over the internal communication links to the DRNG, regardless of the size of the destination register. So if you read 32 bits, it pulls 64 bits and throws away the top half. If you read 16 bits, it pulls 64 and throws away the top 3/4.
This is not described in the instruction documentation because it may not continue to be true in future products. A chip might be designed which stashes and uses the unused parts of the 64 bit word. However there isn't a significant performance imperative to do this today.
For the highest throughput, the most effective strategy is to pull from parallel threads. This is because there is parallelism in the bus hierarchy on chip. Most of the time for the instruction is transit time across the buses. Performing that transit in parallel is going to yield a linear increase in throughput with the number of threads, up to the maximum of 800MBytes/s. The second thing is to use 64-bit RdRands, because they get more data per instruction.
Part 2. What does CF=0 mean really?
It means 'random data not available'. This is because the details of why it can't get a number are not available to the CPU core without it going off and reading more registers, which it isn't going to do because there is nothing it can do with the information.
If you sucked the output buffer of the DRNG dry, you would get an underflow (CF=0) but you could expect the next RdRand to succeed, because the DRNG is fast.
If the DRNG failed (e.g. a transistor popped in the entropy source and it no longer was random) then the online health tests would detect this and shut down the DRNG. Then all your RdRand invocations would yield CF=0.
However on Ivy Bridge, you will not be able to underflow the buffer. The DRNG is a little faster than the bus to which it is attached. The effect of pulling more data per unit time (with parallel threads) will be to increase the execution time of each individual RdRand as contention on the bus causes the instructions to have to wait in line at the DRNG's local bus. You can never pull so fast the the DRNG will underflow. You will asymptotically reach 800 MBytes/s.
This also is not described in the documentation because it may not continue to be true in future products. We can envisage products where the buses are faster and the cores faster and the DRNG would be able to be underflowed. These things are not known yet, so we can't make claims about them.
What will remain true is that the basic loop (try up to 10 times, then report a failure up the stack) given in the software implementors guide will continue to work in future products, because we've made the claim that it will and so we will engineer all future products to meet this.
So no, CF=0 cannot occur because "the buffers happen to be (transiently) empty when RDRAND is invoked" on Ivy Bridge, but it might occur on future silicon, so design your software to cope.