Montgomery REDC using the positive inverse (mod R)

This is Part 1 in a planned three part series on Montgomery arithmetic. A few days ago, Part 0 (the prequel?) showed how best to calculate the multiplicative inverse. If you would like an excellent implementation of Montgomery arithmetic and modular arithmetic in general for up to 128 bit integer types, please see Clockwork, a high performance and easy to use C++ header-only library. The following post has significant performance benefit for native (CPU register size) integer types and lesser impact for multiprecision integers.

Traditionally the Montgomery REDC algorithm uses the negative inverse (mod R) of the modulus – see Peter Montgomery’s famous paper, “Modular multiplication without trial division”, which has been cited over 3000 times. However, there exists an alternative form of this algorithm that uses the positive inverse, with better performance in practice and simpler code, and no apparent disadvantages. The obvious (once you see it) improvement makes it remarkable that this version is almost unknown in the literature.

I’ve had limited luck tracing its origins – I first encountered it in Ernst W Mayer’s paper “Efficient long division via Montgomery multiply”, where he uses it in introductory material without remark. Via email, Ernst gave credit to Peter Montgomery for teaching him Montgomery multiplication using this approach in the 1990s. After I scanned through all of Montgomery’s publications, I found that Montgomery used this approach in one paper he co-authored, “Montgomery Multiplication Using Vector Instructions”. The method is treated as an incidental minor detail without citation; Mayer’s paper predates this paper. Otherwise Montgomery used the original 1985 method in all his papers and in his 1992 dissertation. Aside from one web site’s description of Montgomery multiplication (it’s a translation of a Russian source that I haven’t tracked down) and the aforementioned papers by Mayer and by Montgomery, every book, paper, and website I have scanned uses the traditional 1985 approach. This includes the books Handbook of Applied Cryptography, Prime Numbers: A Computational Perspective, Modern Computer Arithmetic, Wikipedia, and 20+ papers on Montgomery multiplication/reduction (a partial list: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24)

Background

It’s perhaps plausible that Peter Montgomery discovered the 1985 algorithm after asking himself the question, “could I add some multiple of N to an integer T, to make the sum divisible by R (assuming R is a power of 2 and N is odd)?” The answer is yes, and his 1985 paper shows how to calculate that multiple m and use it to in turn calculate TR^-1 (mod N), which is the basis for Montgomery modular multiplication. Wikipedia gives a fairly easy and thorough description of the algorithm and proof.

[Note: the choice of T for an integer variable name is unfortunate since it conflicts with the idiomatic use in C++ of T for a template parameter. We simply follow the historical conventions of Montgomery REDC presentations for all names here.]

The Alternative Version

We’ll get a slightly different algorithm if we ask a slightly different question – “could we subtract some multiple of N from an integer T, to make the difference divisible by R?” We can show that the answer is yes. Some integer ‘m’ always exists such that T – mN ≡ 0 (mod R). And we can provide a way to calculate m, and TR^-1 (mod N).

Given that R is a power of 2 and N is odd, we know the greatest common divisor of R and N is 1, and thus by Bézout’s identity there exist integers x and y such that xN + yR = 1. Taking this mod R, we have xN ≡ 1 (mod R), and so x is the modular multiplicative inverse of N (mod R), which we denote as x ≡ N^-1 (mod R). Multiplying both sides by T, we get xTN ≡ T (mod R), which after rearranging is T – xTN ≡ 0 (mod R). Therefore our desired integer ‘m’ equals xT. More precisely,
m ≡ N^-1T (mod R), and T – mN ≡ 0 (mod R).

A different way to get the same result is to follow a line of reasoning similar to that used in the Wikipedia REDC article. Let m = ((T mod R)N^-1) mod R. Then m ≡ N^-1T (mod R) and
T – mN ≡ T – (TN^-1)N ≡ T – T ≡ 0 (mod R).

Continuing to follow reasoning similar to the Wikipedia article, we can see that the above result proves that T – mN is divisible by R, and therefore t = (T – mN)/R must be an integer. This means that
t ≡ (T – mN)R^-1 ≡ TR^-1 – (mR^-1)N ≡ TR^-1 (mod N).

So we have a way to calculate both m and TR^-1 (mod N).

Finally, let’s deduce bounds for t, assuming we have bounds on the inputs of 0 < N < R, and 0 <= T < RN. Since m is calculated mod R, we know 0 <= m < R, and so -R < -m <= 0 and -RN < -mN <= 0. Thus, -RN <= T – RN < T – mN <= T < RN. This gives us -RN < T – mN < RN, and since all parts are divisible by R, we have -N < (T – mN)/R < N. And thus we have the bounds for t:
-N < t < N.

We can now write the alternative REDC algorithm:

function REDC2 is
    input: Integers R and N with gcd(R, N) = 1, and N in [0, R − 1],
           Integer N⁻¹ in [0, R − 1] such that NN⁻¹ ≡ 1 (mod R),
           Integer T in the range [0, RN − 1]
    output: Integer S in the range [0, N − 1] such that S ≡ TR⁻¹ (mod N)

    m ⇐ ((T mod R)N⁻¹) mod R
    t ⇐ (T − mN) / R
    if t < 0 then
        return t + N
    else
        return t
    end if
end function

Why It Matters

On the surface, the alternative REDC algorithm and the traditional 1985 version look very similar. However, they differ in important ways when implemented, with the result that the alternative version is simpler, more efficient, and easier to implement well in standard C. Primarily the alternative version’s implementation advantage arises from its use of (T − mN) verses the traditional version’s use of (T + mN). Calculating the low and high words of the subtraction is easier and simpler than the addition: for the alternative version, T − mN ≡ 0 (mod R), and for the traditional version, T + mN ≡ 0 (mod R). So for the former implementation, the low half bits of the subtraction will always equal 0 and will never generate a borrow/carry. For the latter implementation, the low half bits of the addition will always equal 0, and will always generate a carry unless both T and mN equal 0. This exception for T and mN equaling 0 means that the carry must always be calculated for the traditional version (usually by performing the low part addition), and then an add-with-carry instruction (or its equivalent) must be used to sum the high part of the addition. In contrast, the alternative algorithm’s subtraction of the low part of T − mN can be completely ignored and omitted from its implementation, because we know it never has any effect on the rest of the subtraction – the result of the low part subtraction is always zero and it never generates a borrow/carry. A further difference between the two algorithm versions’ implementations is that the traditional version usually has two conditionals that must be checked: possible overflow on t ⇐ T + mN, and t ≥ N; in contrast, the alternative version has only one conditional check: t < 0.

It may be easiest to see the differences by comparing assembly code. We’ll look at two x86_64 inline assembly C functions, implementing the traditional REDC and the alternative REDC for a 64 bit unsigned integer type.

x86_64 Assembly Implementations Comparison

We’ll first present an implementation of the traditional REDC (using x86_64 assembly):

// On Intel Skylake: ~11 cycles latency, 11 fused uops.
inline uint64_t REDC_traditional(uint64_t T_hi,
                                 uint64_t T_lo,
                                 uint64_t N,
                                 uint64_t negInvN)
{
   assert(N % 2 == 1); // REDC requires odd modulus
   assert(T_hi < N);   // REDC requires T < NR, and this enforces it.
   uint64_t rrax = T_lo;
   uint64_t rrdx, tmp;
   __asm__ (
      "movq %%rax, %[tmp] \n\t"
      "imulq %[inv], %%rax \n\t"    /* m = T_lo * negInvN */
      "mulq %[N] \n\t"              /* mN = m * N */
      "addq %[tmp], %%rax \n\t"     /* rax = T_lo+mN_lo. Sets carry, rax==0 */
      "adcq %[Thi], %%rdx \n\t"     /* t_hi = addcarry(T_hi, mN_hi) */
      "cmovaeq %[N], %%rax \n\t"    /* rax = (t_hi >= T_hi) ? N : 0 */
      "xorl %k[tmp], %k[tmp] \n\t"  /* tmp = 0 */
      "subq %[N], %%rdx \n\t"       /* rdx = t_hi - N */
      "cmovaeq %[tmp], %%rax \n\t"  /* rax = (t_hi >= N) ? 0 : rax */
      : "+&a"(rrax), "=&d"(rrdx), [tmp]"=&r"(tmp)
      : [Thi]"r"(T_hi), [N]"r"(N), [inv]"r"(negInvN)
      : "cc");
   uint64_t result = rrax + rrdx;   // let compiler choose between add/lea
   assert(result < N);
   return result;
}

Traditional REDC

The code above is decent but not completely ideal. We can see that it incorporates an add-carry and two conditional moves, as we discussed. We will see below that the alternative REDC does not need an add-carry (or a subtract-borrow), and it also needs only one conditional move instead of two.

Another argument against this code (as given) is simply that it is inline asm. Our absolute ideal would be code written in only standard C (using no inline asm at all) that compiles to assembly that is nearly as optimal as our inline asm functions. This is easy to wish for, though not always easy to achieve. If we write a standard C version of the traditional REDC, we can see from our inline asm’s use of conditional move and add-carry instructions that we may have trouble coaxing the C compiler into generating optimal assembly; those instructions don’t exist in C. We could use C compiler idioms to help the compiler, such as the ternary operator (which typically translates into a conditional move), but we still will find none of the major compilers (gcc, clang, MSVC, icc) produce optimal assembly from a standard C version of the traditional REDC. This isn’t necessarily a major problem, since we can of course use the given inline asm, but we should note that some compilers don’t support inline asm (MSVC 64bit) and also that inline asm code tends to be unusually bug prone. I can give you comfortable assurances that I’m skilled at writing high quality inline asm, but you would be wise to be skeptical of this and of any inline asm you see – including our inline asm functions here. Unit tests of inline asm are far less helpful than you might think, potentially producing false negatives caused by the particular register choices your compiler happens to make for code surrounding your inline asm under test. There are reasonable arguments to be made for banning inline asm from your projects.

We’ll briefly note that we could rewrite the inline asm above to improve the traditional REDC. However, in nearly all respects the alternative REDC will provide an even greater improvement, so this is mentioned for curiosity sake, or for the case that you happen to want a drop-in replacement for an existing traditional REDC function.

Below we have the alternative REDC algorithm inline asm C function, which comes close to our ideal. Note that it uses the positive inverse as a parameter, whereas the traditional algorithm uses the negative inverse as a parameter. The function consists of a relatively straightforward translation of the alternate algorithm into inline asm, and thus it’s fairly easy to understand. It can be implemented with standard C code (given below) almost as effectively as with inline asm; clang produces close to optimal assembly from the standard C version of it (other major compilers produce assembly that is acceptable but not as good). For x86_64, it provides us with the best performance we’ve seen, with both lowest latency and fewest used instructions/uops/registers. I would expect the alternate REDC to provide similar benefits for ARM and other architectures.

// On Intel Skylake: 9 cycles latency, 7 fused uops.
inline uint64_t REDC_alternative(uint64_t T_hi,
                                 uint64_t T_lo,
                                 uint64_t N,
                                 uint64_t invN)
{
   assert(N % 2 == 1); // REDC requires odd modulus
   assert(T_hi < N);   // REDC requires T < NR, and this enforces it.
   uint64_t rrax = T_lo;
   uint64_t Thi = T_hi;
   __asm__ (
      "imulq %[inv], %%rax \n\t"        /* m = T_lo * invN */
      "mulq %[N] \n\t"                  /* mN = m * N */
      "leaq (%[Thi], %[N]), %%rax \n\t" /* rax = T_hi + N */
      "subq %%rdx, %%rax \n\t"          /* rax = rax - mN_hi */
      "subq %%rdx, %[Thi] \n\t"         /* t_hi = T_hi - mN_hi */
      "cmovbq %%rax, %[Thi] \n\t"       /* t_hi = (T_hi<mN_hi) ? rax : t_hi */
      : [Thi]"+&bcSD"(Thi), "+&a"(rrax)
      : [N]"r"(N), [inv]"r"(invN)
      : "rdx", "cc");
   uint64_t result = Thi;
   assert(result < N);
   return result;
}

Alternative REDC (Positive Inverse), Inline-Asm

For convenience, here is the alternative REDC with no inline-asm and just standard C (with the exception of the __uint128_t compiler extension):

inline uint64_t REDC_alternative_no_asm(uint64_t T_hi,
                                        uint64_t T_lo,
                                        uint64_t N,
                                        uint64_t invN)
{
   assert(N % 2 == 1); // REDC requires odd modulus
   assert(T_hi < N);   // REDC requires T < NR, and this enforces it.
   uint64_t m = T_lo * invN;
   __uint128_t mN = (__uint128_t)m * N;
   uint64_t mN_hi = (uint64_t)(mN >> 64);
   uint64_t tmp = T_hi + N;
   tmp = tmp - mN_hi;
   uint64_t result = T_hi - mN_hi;
   result = (T_hi < mN_hi) ? tmp : result;
   assert(result < N);
   return result;
}

Alternative REDC (Positive Inverse), No Inline-Asm

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.