*I’d originally just intended to write up a blog entry on this, but the proofs were a better fit and easier to write in the form of a paper. The paper presents what should generally be the fastest multiplicative inverse modulo a power of 2, when the power is less than or equal to the CPU’s native bit width. This is a short introduction. See the pdf for the algorithm*,* the proof, and analysis.*

Yesterday we saw how to implement the modular multiplicative inverse. However it’s quite common that we only need to find the inverse for the special case of a modulus that is a power of 2 (or more specifically, 2^{32} or 2^{64}). The last post used the Extended Euclidean algorithm, and the fastest methods for this special case are based on Newton’s Method, which has much better computational complexity.

For this special case I made slight improvements to the fastest known algorithm (by Jean-Guillaume Dumas) for modern hardware and wrote up the algorithm variant in a paper. I learned about Dumas’ work through Marc Reynolds, who pointed out the very nice instruction level parallelism advantage it has over other methods. The following code for the 32 bit and 64 bit inverses is taken from the new paper, and we generally can expect it to complete in the fewest cycles currently known for modern microprocessors:

```
uint32_t multiplicative_inverse(uint32_t a)
{
assert(a%2 == 1); // the inverse (mod 2<<32) only exists for odd values
uint32_t x0 = (3*a)^2;
uint32_t y = 1 - a*x0;
uint32_t x1 = x0*(1 + y);
y *= y;
uint32_t x2 = x1*(1 + y);
y *= y;
uint32_t x3 = x2*(1 + y);
return x3;
}
```

*Figure 1: C code for the 32 bit integer inverse.*

```
uint64_t multiplicative_inverse(uint64_t a)
{
assert(a%2 == 1); // the inverse (mod 2<<64) only exists for odd values
uint64_t x0 = (3*a)^2;
uint64_t y = 1 - a*x0;
uint64_t x1 = x0*(1 + y);
y *= y;
uint64_t x2 = x1*(1 + y);
y *= y;
uint64_t x3 = x2*(1 + y);
y *= y;
uint64_t x4 = x3*(1 + y);
return x4;
}
```

*Figure 2: C code for the 64 bit integer inverse.*

For details, see the paper for a full description of the algorithm, the proof, and latency analysis.

## Achieving High Throughput

We can generally achieve highest throughput by using Montgomery’s trick, since it lets us calculate more than one inverse at the cost of only one inverse and some extra multiplications. The easiest way to explain the trick is to demonstrate it as a way to calculate two inverse results. We’ll use two identities:

**a**^{-1} **≡** (**a** * **b**)** ^{-1}** *

**b**(mod

**n**)

**b**

^{-1}

**≡**(

**a***

**b**)

*****

^{-1}**a**(mod

**n**)

The identities let us inexpensively find the inverses of both **a** and **b** (mod **n**). We first calculate the modular multiplication **c** **≡** **a*****b** (mod **n**), and then calculate just a single modular inverse **c**^{-1}, which we use as a modular multiplier to get our desired inverses **a**^{-1} and **b**^{-1}. The total cost is one inverse and three modular multiplications.

In practice, the modular multiplications above will often be equivalent to plain multiplications on a CPU. Since for this discussion we are interested in finding the inverse modulo some power of 2, we know our modulus **n** is a power of 2. For the common case that we set the power equal to the CPU’s native bit width (e.g. resulting in **n** = 2^{32} or 2^{64}), each modular multiplication above could be performed by a single basic CPU multiplication instruction. Calculating the single inverse is usually the only “hard” part.

The following code demonstrates Montgomery’s trick to calculate two 64 bit inverses:

```
void dual_inverse(uint64_t a, uint64_t b, uint64_t* inv_a, uint64_t* inv_b)
{
assert(a%2 == 1); // the inverse (mod 2<<64) only exists for odd values
assert(b%2 == 1);
uint64_t c = a*b;
uint64_t c_inv = multiplicative_inverse(c);
*inv_a = c_inv * b;
*inv_b = c_inv * a;
}
```

*Figure 3: C code to efficiently calculate two 64 bit integer inverses.*

This trick can be extended to obtain a large number of modular inverses. The cost remains just a single inverse calculation, accompanied by a large number of modular multiplications.

Copyright (c) 2022 by Jeffrey Hurchalla.

All code in this file is licensed under the MIT Open Source License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Pingback: Montgomery REDC using the positive inverse (mod R) | Jeffrey Hurchalla