The Montgomery Fused Multiply-Add

This is Part 3 in a three part series on Montgomery arithmetic, starting with Part 1. If you would like an excellent implementation of Montgomery arithmetic and modular arithmetic in general for up to 128 bit integer types, please see Clockwork, a high performance and easy to use C++ header-only library. The following post has greatest benefit for native (CPU register size) integer types and has lesser impact for multiprecision integers.

Performance for Montgomery multiplication can often get bottlenecked by long data dependency chains, and we’ll see here how to reduce this problem when it involves a Montgomery multiply followed by a modular add or subtract. A dependency chain is a series of operations in which every operation can begin only after it has access to the result of the preceding operation in the chain. It’s a fundamental problem for good performance. Dependency chains prevent a CPU from executing multiple instructions within the chain at the same time (via pipelining and/or superscalar execution units), because the instructions must inherently execute one after another. If the CPU can find no other instructions to interleave with the chain via out of order execution, then pipelining and superscalar capabilities go unused, and the CPU’s throughput goes down.

For Montgomery multiplication inside a loop, we may find most or all instructions in the loop depend upon their preceding instruction, and we may have no other useful extra work (instructions) to supply to the CPU to interleave with the dependency chain. If we could somehow move an instruction off of the dependency chain, we would improve performance, since the processor would be able to execute the newly independent instruction at the same time as the dependency chain. For Montgomery multiplication followed by an add or subtract, we’ll show how to achieve this. If you’d just like to skip ahead to the 64 bit C functions that implement this Montgomery Fused Multiply-Add and Fused Multiply-Subtract, scroll down to Figures 4 and 5 below.

A Dependency Chain Example

Let’s look at Pollard-Rho sequence generation in the Montgomery domain modulo N, which has the dependency chain problem. We have a Montgomery constant value c which we restrict to 0 < c < N, and a Montgomery variable x which we will use for our sequence, and we define the sequence as
x_i+1 = x_i² + c

This is a Montgomery multiplication (squaring) followed by a modular addition. Implicitly, this sequence will be calculated modulo N since we are working in the Montgomery domain modulo N. To advance this sequence by j steps we might write a function similar to the following:

uint64_t advance(uint64_t x, uint64_t c, uint64_t N, uint64_t inverseN, int j)
{
   for (int i=0; i<j; ++i) {
      __uint128_t z = x*static_cast<__uint128_t>(x);  // standard multiply.
      x = montgomery_REDC(z, N, inverseN);            // completes Mont Mult.
      x = (x < N-c) ? x+c : x+c-N;                    // modular addition.
   }
   return x;
}

Figure 1, Example: Advancing a Pollard-Rho Sequence

The line for modular addition (via ternary operator) can not begin until the Montgomery REDC completes, and in the next loop iteration neither the standard multiply nor the Montgomery REDC can begin until the preceding iteration’s modular addition completes. Thus when the CPU computes this modular addition, it generally won’t be able to execute anything else at the same time.

Unoptimized Montgomery Multiplication with Modular Addition

Expressed in the form of a basic unoptimized algorithm (that still has the dependency chain issue), Montgomery multiplication followed by a modular addition is

function Montgomery_multiply_then_add is
    input: Integers R and N with gcd(R, N) = 1, and N in [0, R - 1],
           Integer N⁻¹ in [0, R − 1] such that NN⁻¹ ≡ 1 (mod R),
           Integers x and y, such that 0 <= xy < NR,
           Integer c in the range [0, N − 1]
    output: Integer S in the range [0, N − 1] such that S ≡ xyR⁻¹ + c (mod N)

    z ⇐ xy
    t ⇐ REDC(z, N, N⁻¹, R)
    if t < N − c then
        return t + c
    else
        return t + c - N
    end if
end function

Figure 2, Basic Montgomery Multiply Followed by Modular Add

The Montgomery REDC function requires that its input z satisfies 0 <= z < NR, and guarantees that its return value t satisfies: t ≡ zR^-1 (mod N). Since this algorithm specifies as an input precondition that 0 <= xy < NR, we know 0 <= z < NR, which satisfies REDC’s precondition.

When used in a loop, this unoptimized algorithm has the problem we saw in Figure 1 with the modular addition being part of one long loop-carried dependency chain.

An Improved Algorithm: A Fused Montgomery Multiply-Add

Figure 2 calculates REDC(xy) and then performs a modular addition with c, and these operations must execute serially, very often without any benefit of pipelining or superscalar execution. We’ll show how to instead perform the modular addition prior to the REDC, with a final result congruent to t + c (mod N), which is congruent to the result of Figure 2. The advantage of this alternate method is that the modular addition will usually execute in parallel with the first two multiplies inside REDC(), because REDC’s internal multiplies will not depend on the result of the modular addition. We’ll call this a “Montgomery fused multiply-add” because the modular addition will be located in program-order in the middle of the Montgomery multiply, and because we’ll normally expect the modular add to execute at the same time as the Montgomery multiply. We’ll look at the benefits of this method in the next section, but first let’s present the basic explanation/proof.

Since we will modify the Figure 2 algorithm, the variables z, c, t, N, N^-1, R, and R^-1 below refer to Figure 2.

As an identity,
z = (z – z%R) + z%R
Since (z – z%R) is divisible by R, let
u = (z – z%R)/R.
Let v = z%R.
z = uR + v

Let w = (u + c)%N. We can easily compute this value:
We already showed that 0 <= z < NR, and since 0 <= v < R,
uR <= uR + v == z < NR, and thus
u < N. Furthermore,
0 <= z == uR + v < uR + R == (u + 1)R, and so
0 < u + 1. Thus,
0 <= u < N.
Figure 2 specified that 0 <= c < N.
Thus we can compute the modular addition w = (u + c)%N in C or C++ by using
w = (u < N – c) ? u + c : u + c – N.

REDC(z, N, N^-1, R) guarantees that its return value t satisfies t ≡ zR^-1 (mod N). And therefore,
t ≡ (uR + v)R^-1 (mod N)
t + c ≡ (uR + v)R^-1 + c (mod N)
t + c ≡ (uR + v)R^-1 + cRR^-1 (mod N)
t + c ≡ ((u + c)R + v)R^-1 (mod N)
t + c ≡ (((u + c)%N)R + v)R^-1 (mod N)
t + c ≡ (wR + v)R^-1 (mod N)

Since w = (u + c)%N, we trivially know that 0 <= w <= N–1, and therefore
0 <= wR <= NR–R
Since v = z%R, we trivially know that 0 <= v < R, and therefore
0 <= wR <= wR + v <= NR–R + v < NR–R + R. Thus,
0 <= wR + v < NR

We can see that wR + v is a valid input for REDC, because it satisfies REDC’s requirement on its input value that 0 <= value < NR. Therefore we can call REDC(wR + v, N, N^-1, R), and the REDC algorithm’s postcondition guarantees it will return a value p such that
p ≡ (wR + v)R^-1 (mod N). And thus,
p ≡ t + c (mod N).

Therefore p is congruent to the return value of Figure 2’s algorithm, as desired.

Side note: calculating u and v are typically zero cost operations, since z is usually already represented on the computer via two registers, one containing the high-half bits of z and another containing the low-half bits of z. This occurs because in practice we choose R to be equal to 2 to the power of the CPU’s integer register bit width (e.g. R = 2⁶⁴). Thus the value u is usually the same as the high bits of z, and normally a no-op to calculate. Likewise, calculating v is usually free, since it is the same as the low bits of z. In any event, even if these operations are not free, in Figure 2 in practice these calculations would have been required during the call of REDC, and so it would be unusual for there to be any extra cost in calculating them here.

Let’s modify Figure 2 as we just discussed, so that we have an algorithm with a modular addition before the REDC instead of after the REDC.

function Montgomery_fused_multiply_add is
    input: Integers R and N with gcd(R, N) = 1, and N in [0, R - 1],
           Integer N⁻¹ in [0, R − 1] such that NN⁻¹ ≡ 1 (mod R),
           Integers x and y, such that 0 <= xy < NR,
           Integer c in the range [0, N − 1]
    output: Integer S in the range [0, N − 1] such that S ≡ xyR⁻¹ + c (mod N)

    z ⇐ xy
    u ⇐ (z - z%R)/R
    v ⇐ z%R
    if u < N − c then
        w ⇐ u + c
    else
        w ⇐ u + c - N
    end if
    s ⇐ wR + v
    p ⇐ REDC(s, N, N⁻¹, R)
    return p
end function

Figure 3, Montgomery Fused Multiply-Add Algorithm

The modular addition in Figure 3 to calculate w occurs before REDC, and as we will see, the modular addition is no longer an immediate part of the dependency chain. As discussed in the side note above, u and v generally have no cost to calculate. For similar reasons s is usually free to calculate too.

Let’s implement the Figure 3 algorithm. We’ll write a C function to carry out the fused Montgomery multiply-add, and for the sake of demonstration we will provide an unused #else section in which we explicitly inline its internal REDC (a version using the positive inverse).

uint64_t monty_fma(uint64_t x, uint64_t y, uint64_t c,
                   uint64_t N, uint64_t inverseN)
{
    __uint128_t z = (__uint128_t)x * y;
    assert(z < ((__uint128_t)N << 64));  //asserts z < NR
    uint64_t u = (uint64_t)(z >> 64);
    uint64_t v = (uint64_t)z;
    uint64_t w = (u < N-c) ? u+c : u+c-N;  // modular add
#if 1
    uint64_t s_highbits = w;
    uint64_t s_lowbits = v;
    return REDC(s_highbits, s_lowbits, N, inverseN);
#else
    // this unused section explicitly inlines REDC,
    // only for demonstration purposes.
    uint64_t m = v * inverseN;
    __uint128_t mN = (__uint128_t)m * N;
    uint64_t mN_hi = (uint64_t)(mN >> 64);
    return (w < mN_hi) ? w - mN_hi + N : w - mN_hi;
#endif
}

Figure 4, Example 64 bit C function for Montgomery Fused Multiply-Addition

If we had used the basic algorithm from Figure 2, an extra modular addition would have occurred at the end of the function (note that the modular addition on line 19 is not extra – it is an internal modular addition belonging to REDC that we can see due to explicitly inlining REDC). However, since we used the algorithm from Figure 3, we have no extra modular addition at the end of the function. Instead, the extra modular addition is now located on line 8, prior to the REDC. We can see that the multiplications on lines 16 and 17 don’t depend on w (the result of line 8’s modular addition), and so they can execute in parallel with the computation of w. This modular addition to compute w no longer contributes to the latency of a dependency chain during loops that compute Montgomery-multiplications-with-adds (such as Figure 1).

The new location in Figure 3 for its modular addition practically always removes the modular addition as an immediate part of the dependency chain. It was easy to see this happened in the example of Figure 4, but in general the reason this happens is somewhat subtle. Inside REDC, the first operation of REDC (to calculate m) always uses REDC’s input value modulo R. Since Figure 3’s input to REDC is wR + v, and wR + v ≡ v (mod R), the first operation in REDC does not actually need to have the value of wR or w; it only needs to have the value of v in order to begin executing. Therefore the first operation in REDC in principle is able to execute in parallel with the modular addition that calculates w. In practice, we usually write and call a REDC function that uses two parameters for the input value, as in Figure 4 (first parameter has the input value’s high-half bits, and the second parameter has the low-half bits), making it nearly certain that the modular add and the first operation within REDC will be able to execute in parallel. Although inlining REDC is necessary to achieve the parallel execution, we can guarantee inlining REDC via compiler extensions like the __always_inline attribute or __force_inline keyword. There is rarely any need to manually inline REDC as done in the #else of Figure 4.

We can trivially modify Figure 3 to create a Montgomery Fused Multiply-Subtract algorithm, by performing a modular subtraction in place of the modular addition. We’ll skip explicitly writing out this modification, and instead provide the resulting 64 bit C implementation:

uint64_t monty_fmsub(uint64_t x, uint64_t y, uint64_t c,
                   uint64_t N, uint64_t inverseN)
{
    __uint128_t z = (__uint128_t)x * y;
    assert(z < ((__uint128_t)N << 64));  //asserts z < NR
    uint64_t u = (uint64_t)(z >> 64);
    uint64_t v = (uint64_t)z;
    uint64_t w = (u >= c) ? u-c : u-c+N;  // modular sub

    uint64_t s_highbits = w;
    uint64_t s_lowbits = v;
    return REDC(s_highbits, s_lowbits, N, inverseN);
}

Figure 5, Example 64 bit C function for Montgomery Fused Multiply-Subtraction

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.