*I’d originally just intended to write up a blog entry on this, but the proofs were a better fit and easier to write in the form of a paper. The paper presents what should generally be the fastest multiplicative inverse modulo a power of 2, when the power is less than or equal to the CPU’s native bit width. This is a short introduction. See the pdf for the algorithm*,* the proof, and analysis.*

Yesterday we saw how to implement the modular multiplicative inverse. However it’s quite common that we only need to find the inverse for the special case of a modulus that is a power of 2 (or more specifically, 2^{32} or 2^{64}). The last post used the Extended Euclidean algorithm, and the fastest methods for this special case are based on Newton’s Method, which has much better computational complexity.

For this special case I made slight improvements to the fastest known algorithm (by Jean-Guillaume Dumas) for modern hardware and wrote up the algorithm variant in a paper. I learned about Dumas’ work through Marc Reynolds, who pointed out the very nice instruction level parallelism advantage it has over other methods. The following code for the 32 bit and 64 bit inverses is taken from the new paper, and we generally can expect it to complete in the fewest cycles currently known for modern microprocessors:

```
uint32_t multiplicative_inverse(uint32_t a)
{
assert(a%2 == 1); // the inverse (mod 2<<32) only exists for odd values
uint32_t x0 = (3*a)^2;
uint32_t y = 1 - a*x0;
uint32_t x1 = x0*(1 + y);
y *= y;
uint32_t x2 = x1*(1 + y);
y *= y;
uint32_t x3 = x2*(1 + y);
return x3;
}
```

*Figure 1: C code for the 32 bit integer inverse.*

```
uint64_t multiplicative_inverse(uint64_t a)
{
assert(a%2 == 1); // the inverse (mod 2<<64) only exists for odd values
uint64_t x0 = (3*a)^2;
uint64_t y = 1 - a*x0;
uint64_t x1 = x0*(1 + y);
y *= y;
uint64_t x2 = x1*(1 + y);
y *= y;
uint64_t x3 = x2*(1 + y);
y *= y;
uint64_t x4 = x3*(1 + y);
return x4;
}
```

*Figure 2: C code for the 64 bit integer inverse.*

For details, see the paper for a full description of the algorithm, the proof, and latency analysis.

All code in this file is licensed under the MIT Open Source License:

Copyright (c) 2022 by Jeffrey Hurchalla.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Pingback: Montgomery REDC using the positive inverse (mod R) | Jeffrey Hurchalla