I need to calculate the speed difference between performing a Montgomery Multiplication page 602-603 with a word-size/register of size 32 vs. 64.
So far, this is what I understand:
- x and y are represented by multiple-word arrays of length n where n = m/w and w is the register size (either 32 or 64).
- The total number of single-digit multiplications in Montgomery multiplication is n*(2 + 2*n), where n represents the number length of the word-arrays.
- I will assume that the multiplication of two single-digit takes 1 clock cycle on each of the computers.
How can I put all this together to represent the number of clock cycles needed in Montgomery multiplication on a computer with a 32-bit register or 64-bit register?
The number of cycles for a multiple-precision Montgomery multiplication would indeed be
n(2+2*n)if all the intermediate single-precision multiplication operands and results were available in registers. For cryptographic operations this is hardly possible sincemis usually 1024 or larger. Assuming 32-bit registers (xyR^-1 mod m) would require 192 registers only to store the operands (3*(1024/32)). In fact you need to take into account memory accesses to answer this question.A rewrite of the algorithm with memory accesses (assuming multiplications can be done in parallel with loads/stores):
ifrom 0 to n:a_i <- 0ifrom 0 to (n − 1) do the following:a_0y_0x_iu_i <- (a_0 + x_i*y_0)m' mod b. Storeu_iin a registerc = 0(ComputingA <- (A + x_i*y + u_i*m)/b)jfrom 0 to (n-1):a_jy_j(cv) = a_j + x_i*y_j + c, Fetchm_j(cv) = (cv) + u_i*m_j, if j>0 Storea_{j-1} <- va_n <- canda_{n-1} <- vA >= mthenA <- A − m.Hope this helps.