Optimized Montgomery Multiplication with Smaller Modulus Sizes

This is Part 2 in a planned three part series on Montgomery arithmetic. Yesterday, Part 1 showed how to use the positive inverse to improve the traditional Montgomery REDC algorithm. If you would like an excellent implementation of Montgomery arithmetic and modular arithmetic in general for up to 128 bit integer types, please see Clockwork, a high performance and easy to use C++ header-only library. The following post has greatest benefit for native (CPU register size) integer types and has lesser impact for multiprecision integers.

To perform Montgomery multiplication we first specify a constant R, which we set in practice to some power of 2. Most commonly the power is a multiple of our computer register width (e.g. using R = 2⁶⁴), so that operations mod R and division by R are trivial. Second, we must use an odd modulus N < R. Given integers a and b, we convert them into Montgomery form variables x and y respectively, satisfying x ≡ aR (mod N) and y ≡ bR (mod N). We wish to calculate the Montgomery form variable t, satisfying t ≡ abR (mod N). To do so we perform a standard multiplication of x and y (calculating the low and high words of the product), and use this product xy as input to REDC. The output of REDC is our desired value t.

REDC has the crucial precondition requirement that its input (xy above) must satisfy 0 <= input < NR, and so we must ensure 0 <= xy < NR. However, even though it is commonplace to ensure 0 <= x < N and 0 <= y < N, there is no actual requirement to do so and we will find it can be advantageous to specify different requirements. So long as we satisfy REDC’s precondition, x and/or y can be greater than N, or negative. This insight allows us to achieve greater efficiency for Montgomery multiplication when we know the modulus is small in relation to R, such as N < R/4, or N < R/2.

Modulus N < R/4

This special case originates from Section 5 of “Montgomery’s Multiplication Technique: How to Make It Smaller and Faster”. We will use relaxed invariant requirements of
0 <= x < 2N and 0 <= y < 2N.
Since we know N < R/4, we have
0 <= x < 2N < R/2, and thus
0 <= xy < NR.

Thus under our specified requirements for x and y, the product xy will always satisfy REDC’s precondition for an input. REDC produces an output t that satisfies 0 <= t < N, which we can now see is more strict than we need if we will use t as an input for further Montgomery multiplications: we specified that our inputs for Montgomery multiplication need only satisfy 0 <= value < 2N. This means we could use a more efficient REDC that no longer guarantees 0 <= t < N, but instead guarantees 0 <= t < 2N.

Modifying our preferred Alternate REDC algorithm (which uses the positive inverse) to produce an output t that only satisfies 0 <= t < 2N allows us to remove a conditional branch or conditional move from the alternate REDC, giving us

function AlternateREDC_Quarter is
    input: Integers R and N with gcd(R, N) = 1, and N in [0, R/4 − 1],
           Integer N⁻¹ in [0, R − 1] such that NN⁻¹ ≡ 1 (mod R),
           Integer T in the range [0, RN − 1]
    output: Integer S in the range [0, 2N − 1] such that S ≡ TR⁻¹ (mod N)

    m ⇐ ((T mod R)N⁻¹) mod R
    t ⇐ (T − mN) / R
    return t + N
end function

Similarly we could modify the Traditional REDC algorithm to remove the conditional branch (or conditional move) and the subtraction from the end of traditional REDC. The REDC only needs to ensure 0 <= t < 2N.

Modulus N < R/2

This case allows us to create a novel algorithm with similar performance benefits to what we achieved above. We will use the relaxed invariant requirements of -N <= x < N and -N <= y < N.

First we can note that if we need to square x (or y), then we have
0 <= x*x <= NN
and since N < R,
0 <= x*x < NR
This satisfies REDC’s precondition of 0 <= input < NR.

If we multiply x and y, then we have
–NN < xy <= NN
If xy >= 0, then we have 0 <= xy <= NN < NR, which satisfies REDC’s precondition.
If xy < 0, then the product does not satisfy the precondition as-is. However we can use any value congruent (mod N) to the product as an input to REDC, so long as it satisfies the precondition. Let us use xy + NR.
Since xy < 0, we get
–NN + NR < xy + NR < 0 + NR.
Since 0 < N < R, we can see NN < NR, and thus 0 < NR – NN. Therefore we have
0 < xy + NR < NR
which satisfies REDC’s precondition.

To summarize, we must do a signed multiply of x and y (calculating the low and high words of the product as always), and if xy < 0, we must add NR to the product. We use the result as input to REDC and continue as normal.

REDC produces an output t that satisfies 0 <= t < N, which we can now see is more strict than we need if we will use t as an input for further Montgomery multiplications: we specified for this case that our inputs for Montgomery multiplication need only satisfy -N <= value < N. This means we could use a more efficient REDC that no longer guarantees 0 <= t < N, but instead guarantees -N <= t < N.

Modifying our preferred Alternate REDC algorithm (which uses the positive inverse) to produce an output t that only satisfies -N <= t < N allows us to remove a conditional branch or conditional move from the alternate REDC, giving us

function AlternateREDC_Half is
    input: Integers R and N with gcd(R, N) = 1, and N in [0, R/2 − 1],
           Integer N⁻¹ in [0, R − 1] such that NN⁻¹ ≡ 1 (mod R),
           Integer T in the range [0, RN − 1]
    output: Integer S in the range [-N, N − 1] such that S ≡ TR⁻¹ (mod N)

    m ⇐ ((T mod R)N⁻¹) mod R
    t ⇐ (T − mN) / R
    return t
end function

Similarly we could modify the Traditional REDC algorithm to remove the conditional branch (or conditional move) from the end of traditional REDC. Note that for the traditional REDC we would always need to subtract N when producing the final result, to ensure the modified traditional REDC’s output satisfies -N <= t < N.

Ultimately the modifications presented for this case remove a conditional-branch/move from REDC and introduce a conditional-branch/move after the signed multiply that we use at the start of this Montgomery multiplication. This change tends to be quite beneficial. It’s fairly common to have a long dependency chain of Montgomery multiplications in functions needing Montgomery arithmetic, for example with modular exponentiation. Without modification, the old conditional move in the REDC was part of that long dependency chain. The new conditional move after the signed multiply is not needed immediately; multiplications that are on the dependency chain can be calculated at the same time as the conditional move. The reason for this is a little bit subtle. When we conditionally add NR to the signed multiply’s product, it does not change the result modulo R. The first operation in REDC (to calculate m) always uses that result modulo R, rather than the complete result. Therefore REDC can begin its first operation at the same time as the conditional add executes – it does not need to wait for the conditional add to complete.

When the Montgomery multiplication involves a squaring of some value x, rather than a multiplication of x and y, there is no conditional add needed at all, as we saw at the start of this section. The changes presented here become more beneficial.