The Montgomery Fused Multiply-Add

This is Part 3 in a planned four part series on Montgomery arithmetic and an application in factoring, starting with Part 1. If you would like an excellent implementation of Montgomery arithmetic and modular arithmetic in general for up to 128 bit integer types, please see Clockwork, a high performance and easy to use C++ header-only library. The following post has greatest benefit for native (CPU register size) integer types and has lesser impact for multiprecision integers.

Real-world use of Montgomery multiplication can involve long dependency chains – sometimes every instruction in a loop depends upon the result of its preceding instruction. If we can move an operation off of this chain so that it is no longer a dependency of the next instruction, we typically improve performance via decreased latency and increased throughput. The processor becomes able to execute that operation at the same time as instructions that are still directly on the dependency chain, through use of pipelining and/or superscalar execution units. One way we can realize this benefit in Montgomery domain arithmetic is by writing fused-multiply-add and fused-multiply-subtract functions. We will fuse an addition (or a subtraction) into the Montgomery multiplication algorithm, so that the add is no longer an immediate part of the dependency chain.

An Example of the Dependency Chain Problem

As an example, let’s look at Pollard-Rho sequence generation in the Montgomery domain modulo N. We have a Montgomery constant value c which we restrict to 0 < c < N, and a Montgomery variable x which we will use for our sequence, and we define the sequence as
xi+1 = xi2 + c

Implicitly, this sequence will be calculated modulo N since we are working in the Montgomery domain modulo N. To advance this sequence by j steps we might write a function similar to the following:

uint64_t advance(uint64_t x, uint64_t c, uint64_t N, uint64_t inverseN, int j)
{
   for (int i=0; i<j; ++i) {
      __uint128_t z = static_cast<__uint128_t>(x)*x;
      x = montgomery_REDC(z, N, inverseN);
      x = (x < N-c) ? x+c : x+c-N;
   }
}

Figure 1, Example: Advancing a Pollard-Rho Sequence

Obviously the line for the modular addition (via ternary operator) can not begin until the Montgomery REDC completes, and in the next loop iteration neither the standard multiply nor the Montgomery REDC can begin until the preceding iteration’s modular addition completes. It would be nice if we could move the modular addition off of the dependency chain!

Unoptimized Montgomery Multiplication with Modular Addition

Expressed in the form of a basic unoptimized algorithm (that still has the dependency chain issue), Montgomery multiplication followed by a modular addition is

function Montgomery_multiply_then_add is
    input: Integers R and N with gcd(R, N) = 1, and N in [0, R - 1],
           Integer N−1 in [0, R − 1] such that NN−1 ≡ 1 (mod R),
           Integers x and y, such that 0 <= xy < NR,
           Integer c in the range [0, N − 1]
    output: Integer S in the range [0, N − 1] such that SxyR−1 + c (mod N)

    zxy
    t ⇐ REDC(z, N, N−1, R)
    if t < Nc then
        return t + c
    else
        return t + c - N
    end if
end function

Figure 2, Basic Montgomery Multiply Followed by Modular Add

Note that this algorithm calls a Montgomery REDC function. REDC requires that its input z satisfies 0 <= z < NR, and guarantees that its return value t satisfies: tzR-1 (mod N). Since this algorithm specifies as an input precondition that 0 <= xy < NR, we know 0 <= z < NR, which satisfies REDC’s precondition.

When used in a loop, this unoptimized algorithm has the problem we saw in Figure 1 with the modular addition being part of one long dependency chain.

Finding A Better Way

We can improve the dependency chain issue. Figure 2 calculates REDC(xy) and then performs a modular addition with c, and these operations must execute serially, very often without any benefit of pipelining or superscalar execution. We’ll show how to instead perform the modular addition prior to the REDC, with a final result congruent to t + c (mod N), which is congruent to the result of Figure 2. The advantage of this alternate method is that the modular addition will usually execute in parallel with the first two multiplies inside REDC(), because REDC’s internal multiplies will not depend on the result of the modular addition. We’ll call this a “Montgomery fused multiply-add” because the modular addition will be located in program-order in the middle of the Montgomery multiply, and because we’ll normally expect the modular add to execute at the same time as the Montgomery multiply. We’ll analyze the benefits of this method in the next section, but first let’s present the basic explanation/proof.

Since we will modify the Figure 2 algorithm, the variables z, c, t, N, N-1, R, and R-1 below refer to Figure 2.

As an identity,
z = (zz%R) + z%R
Since (zz%R) is divisible by R, let
u = (zz%R)/R.
Let v = z%R.
z = uR + v

Let w = (u + c)%N. We can easily compute this value:
We already showed that 0 <= z < NR, and since 0 <= v < R,
uR <= uR + v == zNR, and thus
u < N. Furthermore,
0 <= z == uR + v < uR + R == (u + 1)R, and so
0 < u + 1. Thus,
0 <= u < N.
Figure 2 specified that 0 <= c < N.
Thus we can compute the modular addition w = (u + c)%N  in C or C++ by using
w = (u < Nc) ? u + c : u + cN.

REDC(z, N, N-1, R) guarantees that its return value t satisfies tzR-1 (mod N). And therefore,
t ≡ (uR + v)R-1 (mod N)
t + c ≡ (uR + v)R-1 + c (mod N)
t + c ≡ (uR + v)R-1 + cRR-1 (mod N)
t + c ≡ ((u + c)R + v)R-1 (mod N)
t + c ≡ (((u + c)%N)R + v)R-1 (mod N)
t + c ≡ (wR + v)R-1 (mod N)

Since w = (u + c)%N, we trivially know that 0 <= w <= N1, and therefore
0 <= wR <= NRR
Since v = z%R, we trivially know that 0 <= vR, and therefore
0 <= wR <= wR + v <= NRR + v < NRR + R. Thus,
0 <= wR + v < NR

We can see that wR + v is a valid input for REDC, because it satisfies REDC’s requirement on its input value that 0 <= value < NR. Therefore we can call REDC(wR + v, N, N-1, R), and the REDC algorithm’s postcondition guarantees it will return a value p such that
p ≡ (wR + v)R-1 (mod N). And thus,
pt + c (mod N).

Therefore p is congruent to the return value of Figure 2’s algorithm, as desired.

Side note: calculating u and v are typically zero cost operations, since z is usually already represented on the computer via two registers, one containing the high-half bits of z and another containing the low-half bits of z. This occurs because in practice we choose R to be equal to 2 to the power of the CPU’s integer register bit width (e.g. R = 264). Thus the value u is usually the same as the high bits of z, and normally a no-op to calculate. Likewise, calculating v is usually free, since it is the same as the low bits of z. In any event, even if these operations are not free, in Figure 2 in practice these calculations would have been required during the call of REDC, and so it would be unusual for there to be any extra cost in calculating them here.

Montgomery Fused Multiply-Add

Let’s make the modifications to Figure 2 we just discussed, so that we have an algorithm with a modular addition before the REDC instead of after the REDC.

function Montgomery_fused_multiply_add is
    input: Integers R and N with gcd(R, N) = 1, and N in [0, R - 1],
           Integer N−1 in [0, R − 1] such that NN−1 ≡ 1 (mod R),
           Integers x and y, such that 0 <= xy < NR,
           Integer c in the range [0, N − 1]
    output: Integer S in the range [0, N − 1] such that SxyR−1 + c (mod N)

    zxy
    u ⇐ (z - z%R)/R
    vz%R
    if u < Nc then
        wu + c
    else
        wu + c - N
    end if
    swR + v
    p ⇐ REDC(s, N, N−1, R)
    return p
end function

Figure 3, Montgomery Fused Multiply-Add Algorithm

The modular addition in Figure 3 to calculate w occurs before REDC, and as we will see, the modular addition is no longer an immediate part of the dependency chain. As we discussed in the side note at the end of the previous section, u and v generally have no cost to calculate. For similar reasons s is usually free to calculate too.

Let’s use a real-life implementation as an example. We’ll write a C function to carry out the fused Montgomery multiply-add, and for the sake of demonstration, we will explicitly inline its internal REDC call.

uint64_t monty_fma(uint64_t x, uint64_t y, uint64_t c,
                   uint64_t N, uint64_t inverseN)
{
    __uint128_t z = (__uint128_t)x * y;
    assert(z < ((__uint128_t)N << 64));  //asserts z < NR
    uint64_t u = (uint64_t)(z >> 64);
    uint64_t v = (uint64_t)z;
    uint64_t w = (u < N-c) ? u+c : u+c-N;

    //__uint128_t s = ((__uin128_t)w << 64) + v;
    //return REDC(s, N, inverseN);

    uint64_t m = v * inverseN;
    __uint128_t mN = (__uint128_t)m * N;
    uint64_t mN_hi = (uint64_t)(mN >> 64);
    return (w < mN_hi) ? w - mN_hi + N : w - mN_hi;
}

Figure 4, Example 64 bit C function for Montgomery Fused Multiply-Addition

If we had used the basic algorithm from Figure 2, an extra modular addition would have occurred at the end of the function (note: the modular addition on line 16 is not extra – it is an internal modular addition belonging to REDC that we can see due to explicitly inlining REDC). However, since we used the algorithm from Figure 3, we have no extra modular addition at the end of the function. Instead, the extra modular addition is now located on line 8, prior to the REDC. We can see that the multiplications on lines 13 and 14 don’t depend on w (the result of line 8’s modular addition), and so they can execute in parallel with this modular addition. This modular addition no longer contributes to the latency of the dependency chain during loops that compute Montgomery-multiplications-with-adds (such as Figure 1).

The new location in Figure 3 for its modular addition practically always removes the modular addition as an immediate part of the dependency chain. It was easy to see this happened in the example of Figure 4, but in general the reason this happens is somewhat subtle. Inside REDC, the first operation of REDC (to calculate m) always uses REDC’s input value modulo R. Since Figure 3’s input to REDC is wR + v, and wR + vv (mod R), the first operation in REDC does not actually need to have the value of wR or w; it only needs to have the value of v in order to begin executing. Therefore the first operation in REDC in principle is able to execute in parallel with the modular addition that calculates w. In practice, we usually write and call a REDC function that uses two parameters for the input value (first parameter would have the input value’s low-half bits, and the second parameter would have the high-half bits), making it nearly certain that the modular add and the first operation within REDC will be able to execute in parallel. Although inlining REDC is necessary to achieve the parallel execution, we can guarantee inlining REDC via compiler extensions like the __always_inline attribute or __force_inline keyword. There is rarely any need to manually inline REDC as we did in Figure 4.

Conclusion

We created a Montgomery Fused Multiply-Add algorithm in Figure 3, and demonstrated an example C function implementation of it in Figure 4. The fused multiply-add has the advantage that the modular addition can generally execute in parallel with the Montgomery multiply, thus reducing latency when there are long dependency chains and increasing throughput. We could use the same approach to create a Montgomery Fused Multiply-Subtract algorithm and implementation.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s