This is Part 3 in a three part series on Montgomery arithmetic, starting with Part 1. If you would like an excellent implementation of Montgomery arithmetic and modular arithmetic in general for up to 128 bit integer types, please see Clockwork, a high performance and easy to use C++ header-only library. The following post has greatest benefit for native (CPU register size) integer types and has lesser impact for multiprecision integers.

Performance for Montgomery multiplication can often get bottlenecked by long data dependency chains, and we’ll see here how to reduce this problem when it involves a Montgomery multiply followed by a modular add or subtract. A dependency chain is a series of operations in which every operation can begin only after it has access to the result of the preceding operation in the chain. It’s a fundamental problem for good performance. Dependency chains prevent a CPU from executing multiple instructions within the chain at the same time (via pipelining and/or superscalar execution units), because the instructions must inherently execute one after another. If the CPU can find no other instructions to interleave with the chain via out of order execution, then pipelining and superscalar capabilities go unused, and the CPU’s throughput goes down.

For Montgomery multiplication inside a loop, we may find most or all instructions in the loop depend upon their preceding instruction, and we may have no other useful extra work (instructions) to supply to the CPU to interleave with the dependency chain. If we could somehow move an instruction off of the dependency chain, we would improve performance, since the processor would be able to execute the newly independent instruction at the same time as the dependency chain. For Montgomery multiplication followed by an add or subtract, we’ll show how to achieve this. If you’d just like to skip ahead to the 64 bit C functions that implement this Montgomery Fused Multiply-Add and Fused Multiply-Subtract, scroll down to Figures 4 and 5 below.

A Dependency Chain Example

Let’s look at Pollard-Rho sequence generation in the Montgomery domain modulo N, which has the dependency chain problem. We have a Montgomery constant value c which we restrict to 0 < c < N, and a Montgomery variable x which we will use for our sequence, and we define the sequence as
xi+1 = xi2 + c

This is a Montgomery multiplication (squaring) followed by a modular addition. Implicitly, this sequence will be calculated modulo N since we are working in the Montgomery domain modulo N. To advance this sequence by j steps we might write a function similar to the following:

``````uint64_t advance(uint64_t x, uint64_t c, uint64_t N, uint64_t inverseN, int j)
{
for (int i=0; i<j; ++i) {
__uint128_t z = x*static_cast<__uint128_t>(x);  // standard multiply.
x = montgomery_REDC(z, N, inverseN);            // completes Mont Mult.
x = (x < N-c) ? x+c : x+c-N;                    // modular addition.
}
return x;
}``````

Figure 1, Example: Advancing a Pollard-Rho Sequence

The line for modular addition (via ternary operator) can not begin until the Montgomery REDC completes, and in the next loop iteration neither the standard multiply nor the Montgomery REDC can begin until the preceding iteration’s modular addition completes. Thus when the CPU computes this modular addition, it generally won’t be able to execute anything else at the same time.

Unoptimized Montgomery Multiplication with Modular Addition

Expressed in the form of a basic unoptimized algorithm (that still has the dependency chain issue), Montgomery multiplication followed by a modular addition is

``````function Montgomery_multiply_then_add is
input: Integers R and N with gcd(R, N) = 1, and N in [0, R - 1],
Integer N−1 in [0, R − 1] such that NN−1 ≡ 1 (mod R),
Integers x and y, such that 0 <= xy < NR,
Integer c in the range [0, N − 1]
output: Integer S in the range [0, N − 1] such that S ≡ xyR−1 + c (mod N)

z ⇐ xy
t ⇐ REDC(z, N, N−1, R)
if t < N − c then
return t + c
else
return t + c - N
end if
end function``````

Figure 2, Basic Montgomery Multiply Followed by Modular Add

The Montgomery REDC function requires that its input z satisfies 0 <= z < NR, and guarantees that its return value t satisfies: tzR-1 (mod N). Since this algorithm specifies as an input precondition that 0 <= xy < NR, we know 0 <= z < NR, which satisfies REDC’s precondition.

When used in a loop, this unoptimized algorithm has the problem we saw in Figure 1 with the modular addition being part of one long loop-carried dependency chain.

An Improved Algorithm: A Fused Montgomery Multiply-Add

Figure 2 calculates REDC(xy) and then performs a modular addition with c, and these operations must execute serially, very often without any benefit of pipelining or superscalar execution. We’ll show how to instead perform the modular addition prior to the REDC, with a final result congruent to t + c (mod N), which is congruent to the result of Figure 2. The advantage of this alternate method is that the modular addition will usually execute in parallel with the first two multiplies inside REDC(), because REDC’s internal multiplies will not depend on the result of the modular addition. We’ll call this a “Montgomery fused multiply-add” because the modular addition will be located in program-order in the middle of the Montgomery multiply, and because we’ll normally expect the modular add to execute at the same time as the Montgomery multiply. We’ll look at the benefits of this method in the next section, but first let’s present the basic explanation/proof.

Since we will modify the Figure 2 algorithm, the variables z, c, t, N, N-1, R, and R-1 below refer to Figure 2.

As an identity,
z = (zz%R) + z%R
Since (zz%R) is divisible by R, let
u = (zz%R)/R.
Let v = z%R.
z = uR + v

Let w = (u + c)%N. We can easily compute this value:
We already showed that 0 <= z < NR, and since 0 <= v < R,
uR <= uR + v == zNR, and thus
u < N. Furthermore,
0 <= z == uR + v < uR + R == (u + 1)R, and so
0 < u + 1. Thus,
0 <= u < N.
Figure 2 specified that 0 <= c < N.
Thus we can compute the modular addition w = (u + c)%N  in C or C++ by using
w = (u < Nc) ? u + c : u + cN.

REDC(z, N, N-1, R) guarantees that its return value t satisfies tzR-1 (mod N). And therefore,
t ≡ (uR + v)R-1 (mod N)
t + c ≡ (uR + v)R-1 + c (mod N)
t + c ≡ (uR + v)R-1 + cRR-1 (mod N)
t + c ≡ ((u + c)R + v)R-1 (mod N)
t + c ≡ (((u + c)%N)R + v)R-1 (mod N)
t + c ≡ (wR + v)R-1 (mod N)

Since w = (u + c)%N, we trivially know that 0 <= w <= N1, and therefore
0 <= wR <= NRR
Since v = z%R, we trivially know that 0 <= vR, and therefore
0 <= wR <= wR + v <= NRR + v < NRR + R. Thus,
0 <= wR + v < NR

We can see that wR + v is a valid input for REDC, because it satisfies REDC’s requirement on its input value that 0 <= value < NR. Therefore we can call REDC(wR + v, N, N-1, R), and the REDC algorithm’s postcondition guarantees it will return a value p such that
p ≡ (wR + v)R-1 (mod N). And thus,
pt + c (mod N).

Therefore p is congruent to the return value of Figure 2’s algorithm, as desired.

Side note: calculating u and v are typically zero cost operations, since z is usually already represented on the computer via two registers, one containing the high-half bits of z and another containing the low-half bits of z. This occurs because in practice we choose R to be equal to 2 to the power of the CPU’s integer register bit width (e.g. R = 264). Thus the value u is usually the same as the high bits of z, and normally a no-op to calculate. Likewise, calculating v is usually free, since it is the same as the low bits of z. In any event, even if these operations are not free, in Figure 2 in practice these calculations would have been required during the call of REDC, and so it would be unusual for there to be any extra cost in calculating them here.

Let’s modify Figure 2 as we just discussed, so that we have an algorithm with a modular addition before the REDC instead of after the REDC.

``````function Montgomery_fused_multiply_add is
input: Integers R and N with gcd(R, N) = 1, and N in [0, R - 1],
Integer N−1 in [0, R − 1] such that NN−1 ≡ 1 (mod R),
Integers x and y, such that 0 <= xy < NR,
Integer c in the range [0, N − 1]
output: Integer S in the range [0, N − 1] such that S ≡ xyR−1 + c (mod N)

z ⇐ xy
u ⇐ (z - z%R)/R
v ⇐ z%R
if u < N − c then
w ⇐ u + c
else
w ⇐ u + c - N
end if
s ⇐ wR + v
p ⇐ REDC(s, N, N−1, R)
return p
end function``````

Figure 3, Montgomery Fused Multiply-Add Algorithm

The modular addition in Figure 3 to calculate w occurs before REDC, and as we will see, the modular addition is no longer an immediate part of the dependency chain. As discussed in the side note above, u and v generally have no cost to calculate. For similar reasons s is usually free to calculate too.

Let’s implement the Figure 3 algorithm. We’ll write a C function to carry out the fused Montgomery multiply-add, and for the sake of demonstration we will provide an unused #else section in which we explicitly inline its internal REDC (a version using the positive inverse).

```uint64_t monty_fma(uint64_t x, uint64_t y, uint64_t c,
uint64_t N, uint64_t inverseN)
{
__uint128_t z = (__uint128_t)x * y;
assert(z < ((__uint128_t)N << 64));  //asserts z < NR
uint64_t u = (uint64_t)(z >> 64);
uint64_t v = (uint64_t)z;
uint64_t w = (u < N-c) ? u+c : u+c-N;  // modular add
#if 1
uint64_t s_highbits = w;
uint64_t s_lowbits = v;
return REDC(s_highbits, s_lowbits, N, inverseN);
#else
// this unused section explicitly inlines REDC,
// only for demonstration purposes.
uint64_t m = v * inverseN;
__uint128_t mN = (__uint128_t)m * N;
uint64_t mN_hi = (uint64_t)(mN >> 64);
return (w < mN_hi) ? w - mN_hi + N : w - mN_hi;
#endif
}
```

Figure 4, Example 64 bit C function for Montgomery Fused Multiply-Addition

If we had used the basic algorithm from Figure 2, an extra modular addition would have occurred at the end of the function (note that the modular addition on line 19 is not extra – it is an internal modular addition belonging to REDC that we can see due to explicitly inlining REDC). However, since we used the algorithm from Figure 3, we have no extra modular addition at the end of the function. Instead, the extra modular addition is now located on line 8, prior to the REDC. We can see that the multiplications on lines 16 and 17 don’t depend on w (the result of line 8’s modular addition), and so they can execute in parallel with the computation of w. This modular addition to compute w no longer contributes to the latency of a dependency chain during loops that compute Montgomery-multiplications-with-adds (such as Figure 1).

The new location in Figure 3 for its modular addition practically always removes the modular addition as an immediate part of the dependency chain. It was easy to see this happened in the example of Figure 4, but in general the reason this happens is somewhat subtle. Inside REDC, the first operation of REDC (to calculate m) always uses REDC’s input value modulo R. Since Figure 3’s input to REDC is wR + v, and wR + vv (mod R), the first operation in REDC does not actually need to have the value of wR or w; it only needs to have the value of v in order to begin executing. Therefore the first operation in REDC in principle is able to execute in parallel with the modular addition that calculates w. In practice, we usually write and call a REDC function that uses two parameters for the input value, as in Figure 4 (first parameter has the input value’s high-half bits, and the second parameter has the low-half bits), making it nearly certain that the modular add and the first operation within REDC will be able to execute in parallel. Although inlining REDC is necessary to achieve the parallel execution, we can guarantee inlining REDC via compiler extensions like the __always_inline attribute or __force_inline keyword. There is rarely any need to manually inline REDC as done in the #else of Figure 4.

We can trivially modify Figure 3 to create a Montgomery Fused Multiply-Subtract algorithm, by performing a modular subtraction in place of the modular addition. We’ll skip explicitly writing out this modification, and instead provide the resulting 64 bit C implementation:

```uint64_t monty_fmsub(uint64_t x, uint64_t y, uint64_t c,
uint64_t N, uint64_t inverseN)
{
__uint128_t z = (__uint128_t)x * y;
assert(z < ((__uint128_t)N << 64));  //asserts z < NR
uint64_t u = (uint64_t)(z >> 64);
uint64_t v = (uint64_t)z;
uint64_t w = (u >= c) ? u-c : u-c+N;  // modular sub

uint64_t s_highbits = w;
uint64_t s_lowbits = v;
return REDC(s_highbits, s_lowbits, N, inverseN);
}
```

Figure 5, Example 64 bit C function for Montgomery Fused Multiply-Subtraction

Copyright (c) 2022 by Jeffrey Hurchalla.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Optimized Montgomery Multiplication with Smaller Modulus Sizes

This is Part 2 in a planned three part series on Montgomery arithmetic. Yesterday, Part 1 showed how to use the positive inverse to improve the traditional Montgomery REDC algorithm. If you would like an excellent implementation of Montgomery arithmetic and modular arithmetic in general for up to 128 bit integer types, please see Clockwork, a high performance and easy to use C++ header-only library. The following post has greatest benefit for native (CPU register size) integer types and has lesser impact for multiprecision integers.

To perform Montgomery multiplication we first specify a constant R, which we set in practice to some power of 2. Most commonly the power is a multiple of our computer register width (e.g. using R = 264), so that operations mod R and division by R are trivial. Second, we must use an odd modulus N < R. Given integers a and b, we convert them into Montgomery form variables x and y respectively, satisfying x ≡ aR (mod N) and ybR (mod N). We wish to calculate the Montgomery form variable t, satisfying t ≡ abR (mod N). To do so we perform a standard multiplication of x and y (calculating the low and high words of the product), and use this product xy as input to REDC. The output of REDC is our desired value t.

REDC has the crucial precondition requirement that its input (xy above) must satisfy 0 <= input < NR, and so we must ensure 0 <= xy < NR. However, even though it is commonplace to ensure 0 <= x < N and 0 <= y < N, there is no actual requirement to do so and we will find it can be advantageous to specify different requirements. So long as we satisfy REDC’s precondition, x and/or y can be greater than N, or negative. This insight allows us to achieve greater efficiency for Montgomery multiplication when we know the modulus is small in relation to R, such as N < R/4, or N < R/2.

Modulus N < R/4

This special case originates from Section 5 of “Montgomery’s Multiplication Technique: How to Make It Smaller and Faster”. We will use relaxed invariant requirements of
0 <= x < 2N and 0 <= y < 2N.
Since we know N < R/4, we have
0 <= x < 2N < R/2, and thus
0 <= xy < NR.

Thus under our specified requirements for x and y, the product xy will always satisfy REDC’s precondition for an input. REDC produces an output t that satisfies 0 <= t < N, which we can now see is more strict than we need if we will use t as an input for further Montgomery multiplications: we specified that our inputs for Montgomery multiplication need only satisfy 0 <= value < 2N. This means we could use a more efficient REDC that no longer guarantees 0 <= t < N, but instead guarantees 0 <= t < 2N.

Modifying our preferred Alternate REDC algorithm (which uses the positive inverse) to produce an output t that only satisfies 0 <= t < 2N allows us to remove a conditional branch or conditional move from the alternate REDC, giving us

``````function AlternateREDC_Quarter is
input: Integers R and N with gcd(R, N) = 1, and N in [0, R/4 − 1],
Integer N−1 in [0, R − 1] such that NN−1 ≡ 1 (mod R),
Integer T in the range [0, RN − 1]
output: Integer S in the range [0, 2N − 1] such that S ≡ TR−1 (mod N)

m ⇐ ((T mod R)N−1) mod R
t ⇐ (T − mN) / R
return t + N
end function``````

Similarly we could modify the Traditional REDC algorithm to remove the conditional branch (or conditional move) and the subtraction from the end of traditional REDC. The REDC only needs to ensure 0 <= t < 2N.

Modulus N < R/2

This case allows us to create a novel algorithm with similar performance benefits to what we achieved above. We will use the relaxed invariant requirements of -N <= x < N and -N <= y < N.

First we can note that if we need to square x (or y), then we have
0 <= x*x <= NN
and since N < R,
0 <= x*x < NR
This satisfies REDC’s precondition of 0 <= input < NR.

If we multiply x and y, then we have
NN < xy <= NN
If xy >= 0, then we have 0 <= xy <= NN < NR, which satisfies REDC’s precondition.
If xy < 0, then the product does not satisfy the precondition as-is. However we can use any value congruent (mod N) to the product as an input to REDC, so long as it satisfies the precondition. Let us use xy + NR.
Since xy < 0, we get
NN + NR < xy + NR < 0 + NR.
Since 0 < N < R, we can see NN < NR, and thus 0 < NRNN. Therefore we have
0 < xy + NR < NR
which satisfies REDC’s precondition.

To summarize, we must do a signed multiply of x and y (calculating the low and high words of the product as always), and if xy < 0, we must add NR to the product. We use the result as input to REDC and continue as normal.

REDC produces an output t that satisfies 0 <= t < N, which we can now see is more strict than we need if we will use t as an input for further Montgomery multiplications: we specified for this case that our inputs for Montgomery multiplication need only satisfy -N <= value < N. This means we could use a more efficient REDC that no longer guarantees 0 <= t < N, but instead guarantees  -N <= t < N.

Modifying our preferred Alternate REDC algorithm (which uses the positive inverse) to produce an output t that only satisfies -N <= t < N allows us to remove a conditional branch or conditional move from the alternate REDC, giving us

``````function AlternateREDC_Half is
input: Integers R and N with gcd(R, N) = 1, and N in [0, R/2 − 1],
Integer N−1 in [0, R − 1] such that NN−1 ≡ 1 (mod R),
Integer T in the range [0, RN − 1]
output: Integer S in the range [-N, N − 1] such that S ≡ TR−1 (mod N)

m ⇐ ((T mod R)N−1) mod R
t ⇐ (T − mN) / R
return t
end function``````

Similarly we could modify the Traditional REDC algorithm to remove the conditional branch (or conditional move) from the end of traditional REDC. Note that for the traditional REDC we would always need to subtract N when producing the final result, to ensure the modified traditional REDC’s output satisfies -N <= t < N.

Ultimately the modifications presented for this case remove a conditional-branch/move from REDC and introduce a conditional-branch/move after the signed multiply that we use at the start of this Montgomery multiplication. This change tends to be quite beneficial. It’s fairly common to have a long dependency chain of Montgomery multiplications in functions needing Montgomery arithmetic, for example with modular exponentiation. Without modification, the old conditional move in the REDC was part of that long dependency chain. The new conditional move after the signed multiply is not needed immediately; multiplications that are on the dependency chain can be calculated at the same time as the conditional move. The reason for this is a little bit subtle. When we conditionally add NR to the signed multiply’s product, it does not change the result modulo R. The first operation in REDC (to calculate m) always uses that result modulo R, rather than the complete result. Therefore REDC can begin its first operation at the same time as the conditional add executes – it does not need to wait for the conditional add to complete.

When the Montgomery multiplication involves a squaring of some value x, rather than a multiplication of x and y, there is no conditional add needed at all, as we saw at the start of this section. The changes presented here become more beneficial.

Montgomery REDC using the positive inverse (mod R)

This is Part 1 in a planned three part series on Montgomery arithmetic. A few days ago, Part 0 (the prequel?) showed how best to calculate the multiplicative inverse. If you would like an excellent implementation of Montgomery arithmetic and modular arithmetic in general for up to 128 bit integer types, please see Clockwork, a high performance and easy to use C++ header-only library. The following post has significant performance benefit for native (CPU register size) integer types and lesser impact for multiprecision integers.

Traditionally the Montgomery REDC algorithm uses the negative inverse (mod R) of the modulus – see Peter Montgomery’s famous paper, “Modular multiplication without trial division”, which has been cited over 3000 times. However, there exists an alternative form of this algorithm that uses the positive inverse, with better performance in practice and simpler code, and no apparent disadvantages. The obvious (once you see it) improvement makes it remarkable that this version is almost unknown in the literature.

I’ve had limited luck tracing its origins – I first encountered it in Ernst W Mayer’s paper “Efficient long division via Montgomery multiply”, where he uses it in introductory material without remark. Via email, Ernst gave credit to Peter Montgomery for teaching him Montgomery multiplication using this approach in the 1990s. After I scanned through all of Montgomery’s publications, I found that Montgomery used this approach in one paper he co-authored, “Montgomery Multiplication Using Vector Instructions”. The method is treated as an incidental minor detail without citation; Mayer’s paper predates this paper. Otherwise Montgomery used the original 1985 method in all his papers and in his 1992 dissertation. Aside from one web site’s description of Montgomery multiplication (it’s a translation of a Russian source that I haven’t tracked down) and the aforementioned papers by Mayer and by Montgomery, every book, paper, and website I have scanned uses the traditional 1985 approach. This includes the books Handbook of Applied Cryptography, Prime Numbers: A Computational Perspective, Modern Computer Arithmetic, Wikipedia, and 20+ papers on Montgomery multiplication/reduction (a partial list: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24)

Background

It’s perhaps plausible that Peter Montgomery discovered the 1985 algorithm after asking himself the question, “could I add some multiple of N to an integer T, to make the sum divisible by R (assuming R is a power of 2 and N is odd)?” The answer is yes, and his 1985 paper shows how to calculate that multiple m and use it to in turn calculate TR-1 (mod N), which is the basis for Montgomery modular multiplication. Wikipedia gives a fairly easy and thorough description of the algorithm and proof.

[Note: the choice of T for an integer variable name is unfortunate since it conflicts with the idiomatic use in C++ of T for a template parameter. We simply follow the historical conventions of Montgomery REDC presentations for all names here.]

The Alternative Version

We’ll get a slightly different algorithm if we ask a slightly different question – “could we subtract some multiple of N from an integer T, to make the difference divisible by R?” We can show that the answer is yes. Some integer ‘m’ always exists such that T – mN ≡ 0 (mod R). And we can provide a way to calculate m, and TR-1 (mod N).

Given that R is a power of 2 and N is odd, we know the greatest common divisor of R and N is 1, and thus by Bézout’s identity there exist integers x and y such that xN + yR = 1. Taking this mod R, we have xN ≡ 1 (mod R), and so x is the modular multiplicative inverse of N (mod R), which we denote as x ≡ N-1 (mod R). Multiplying both sides by T, we get xTN ≡ T (mod R), which after rearranging is T – xTN ≡ 0 (mod R). Therefore our desired integer ‘m’ equals xT. More precisely,
m ≡ N-1T (mod R), and T – mN ≡ 0 (mod R).

A different way to get the same result is to follow a line of reasoning similar to that used in the Wikipedia REDC article. Let m = ((T mod R)N-1) mod R. Then m ≡ N-1T (mod R) and
T – mN ≡ T – (TN-1)N ≡ T – T ≡ 0 (mod R).

Continuing to follow reasoning similar to the Wikipedia article, we can see that the above result proves that T – mN is divisible by R, and therefore t = (T – mN)/R must be an integer. This means that
t ≡ (T – mN)R-1 ≡ TR-1 – (mR-1)N ≡ TR-1 (mod N).

So we have a way to calculate both m and TR-1 (mod N).

Finally, let’s deduce bounds for t, assuming we have bounds on the inputs of 0 < N < R, and 0 <= T < RN. Since m is calculated mod R, we know 0 <= m < R, and so -R < -m <= 0 and -RN < -mN <= 0. Thus, -RN <= T – RN < T – mN <= T < RN. This gives us -RN < T – mN < RN, and since all parts are divisible by R, we have -N < (T – mN)/R < N. And thus we have the bounds for t:
-N < t < N.

We can now write the alternative REDC algorithm:

``````function REDC2 is
input: Integers R and N with gcd(R, N) = 1, and N in [0, R − 1],
Integer N−1 in [0, R − 1] such that NN−1 ≡ 1 (mod R),
Integer T in the range [0, RN − 1]
output: Integer S in the range [0, N − 1] such that S ≡ TR−1 (mod N)

m ⇐ ((T mod R)N−1) mod R
t ⇐ (T − mN) / R
if t < 0 then
return t + N
else
return t
end if
end function``````

Why It Matters

It may be easiest to see the differences by comparing assembly code. We’ll look at two x86_64 inline assembly C functions, implementing the traditional REDC and the alternative REDC for a 64 bit unsigned integer type.

x86_64 Assembly Implementations Comparison

We’ll first present an implementation of the traditional REDC (using x86_64 assembly):

``````// On Intel Skylake: ~11 cycles latency, 11 fused uops.
uint64_t T_lo,
uint64_t N,
uint64_t negInvN)
{
assert(N % 2 == 1); // REDC requires odd modulus
assert(T_hi < N);   // REDC requires T < NR, and this enforces it.
uint64_t rrax = T_lo;
uint64_t rrdx, tmp;
__asm__ (
"movq %%rax, %[tmp] \n\t"
"imulq %[inv], %%rax \n\t"    /* m = T_lo * negInvN */
"mulq %[N] \n\t"              /* mN = m * N */
"addq %[tmp], %%rax \n\t"     /* rax = T_lo+mN_lo. Sets carry, rax==0 */
"cmovaeq %[N], %%rax \n\t"    /* rax = (t_hi >= T_hi) ? N : 0 */
"xorl %k[tmp], %k[tmp] \n\t"  /* tmp = 0 */
"subq %[N], %%rdx \n\t"       /* rdx = t_hi - N */
"cmovaeq %[tmp], %%rax \n\t"  /* rax = (t_hi >= N) ? 0 : rax */
: "+&a"(rrax), "=&d"(rrdx), [tmp]"=&r"(tmp)
: [Thi]"r"(T_hi), [N]"r"(N), [inv]"r"(negInvN)
: "cc");
uint64_t result = rrax + rrdx;   // let compiler choose between add/lea
assert(result < N);
return result;
}``````

The code above is decent but not completely ideal. We can see that it incorporates an add-carry and two conditional moves, as we discussed. We will see below that the alternative REDC does not need an add-carry (or a subtract-borrow), and it also needs only one conditional move instead of two.

Another argument against this code (as given) is simply that it is inline asm. Our absolute ideal would be code written in only standard C (using no inline asm at all) that compiles to assembly that is nearly as optimal as our inline asm functions. This is easy to wish for, though not always easy to achieve. If we write a standard C version of the traditional REDC, we can see from our inline asm’s use of conditional move and add-carry instructions that we may have trouble coaxing the C compiler into generating optimal assembly; those instructions don’t exist in C. We could use C compiler idioms to help the compiler, such as the ternary operator (which typically translates into a conditional move), but we still will find none of the major compilers (gcc, clang, MSVC, icc) produce optimal assembly from a standard C version of the traditional REDC. This isn’t necessarily a major problem, since we can of course use the given inline asm, but we should note that some compilers don’t support inline asm (MSVC 64bit) and also that inline asm code tends to be unusually bug prone. I can give you comfortable assurances that I’m skilled at writing high quality inline asm, but you would be wise to be skeptical of this and of any inline asm you see – including our inline asm functions here. Unit tests of inline asm are far less helpful than you might think, potentially producing false negatives caused by the particular register choices your compiler happens to make for code surrounding your inline asm under test. There are reasonable arguments to be made for banning inline asm from your projects.

We’ll briefly note that we could rewrite the inline asm above to improve the traditional REDC. However, in nearly all respects the alternative REDC will provide an even greater improvement, so this is mentioned for curiosity sake, or for the case that you happen to want a drop-in replacement for an existing traditional REDC function.

Below we have the alternative REDC algorithm inline asm C function, which comes close to our ideal. Note that it uses the positive inverse as a parameter, whereas the traditional algorithm uses the negative inverse as a parameter. The function consists of a relatively straightforward translation of the alternate algorithm into inline asm, and thus it’s fairly easy to understand. It can be implemented with standard C code (given below) almost as effectively as with inline asm; clang produces close to optimal assembly from the standard C version of it (other major compilers produce assembly that is acceptable but not as good). For x86_64, it provides us with the best performance we’ve seen, with both lowest latency and fewest used instructions/uops/registers. I would expect the alternate REDC to provide similar benefits for ARM and other architectures.

``````// On Intel Skylake: 9 cycles latency, 7 fused uops.
inline uint64_t REDC_alternative(uint64_t T_hi,
uint64_t T_lo,
uint64_t N,
uint64_t invN)
{
assert(N % 2 == 1); // REDC requires odd modulus
assert(T_hi < N);   // REDC requires T < NR, and this enforces it.
uint64_t rrax = T_lo;
uint64_t Thi = T_hi;
__asm__ (
"imulq %[inv], %%rax \n\t"        /* m = T_lo * invN */
"mulq %[N] \n\t"                  /* mN = m * N */
"leaq (%[Thi], %[N]), %%rax \n\t" /* rax = T_hi + N */
"subq %%rdx, %%rax \n\t"          /* rax = rax - mN_hi */
"subq %%rdx, %[Thi] \n\t"         /* t_hi = T_hi - mN_hi */
"cmovbq %%rax, %[Thi] \n\t"       /* t_hi = (T_hi<mN_hi) ? rax : t_hi */
: [Thi]"+&bcSD"(Thi), "+&a"(rrax)
: [N]"r"(N), [inv]"r"(invN)
: "rdx", "cc");
uint64_t result = Thi;
assert(result < N);
return result;
}``````

Alternative REDC (Positive Inverse), Inline-Asm

For convenience, here is the alternative REDC with no inline-asm and just standard C (with the exception of the __uint128_t compiler extension):

``````inline uint64_t REDC_alternative_no_asm(uint64_t T_hi,
uint64_t T_lo,
uint64_t N,
uint64_t invN)
{
assert(N % 2 == 1); // REDC requires odd modulus
assert(T_hi < N);   // REDC requires T < NR, and this enforces it.
uint64_t m = T_lo * invN;
__uint128_t mN = (__uint128_t)m * N;
uint64_t mN_hi = (uint64_t)(mN >> 64);
uint64_t tmp = T_hi + N;
tmp = tmp - mN_hi;
uint64_t result = T_hi - mN_hi;
result = (T_hi < mN_hi) ? tmp : result;
assert(result < N);
return result;
}``````

Alternative REDC (Positive Inverse), No Inline-Asm

Copyright (c) 2022 by Jeffrey Hurchalla.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Posted in Uncategorized | 2 Comments

A Faster Multiplicative Inverse (Mod A Power Of 2)

I’d originally just intended to write up a blog entry on this, but the proofs were a better fit and easier to write in the form of a paper. The paper presents what should generally be the fastest multiplicative inverse modulo a power of 2, when the power is less than or equal to the CPU’s native bit width. This is a short introduction. See the pdf for the algorithm, the proof, and analysis.

Yesterday we saw how to implement the modular multiplicative inverse. However it’s quite common that we only need to find the inverse for the special case of a modulus that is a power of 2 (or more specifically, 232 or 264). The last post used the Extended Euclidean algorithm, and the fastest methods for this special case are based on Newton’s Method, which has much better computational complexity.

For this special case I made slight improvements to the fastest known algorithm (by Jean-Guillaume Dumas) for modern hardware and wrote up the algorithm variant in a paper. I learned about Dumas’ work through Marc Reynolds, who pointed out the very nice instruction level parallelism advantage it has over other methods. The following code for the 32 bit and 64 bit inverses is taken from the new paper, and we generally can expect it to complete in the fewest cycles currently known for modern microprocessors:

``````uint32_t multiplicative_inverse(uint32_t a)
{
assert(a%2 == 1);  // the inverse (mod 2<<32) only exists for odd values
uint32_t x0 = (3*a)^2;
uint32_t y = 1 - a*x0;
uint32_t x1 = x0*(1 + y);
y *= y;
uint32_t x2 = x1*(1 + y);
y *= y;
uint32_t x3 = x2*(1 + y);
return x3;
}``````

Figure 1: C code for the 32 bit integer inverse.

``````uint64_t multiplicative_inverse(uint64_t a)
{
assert(a%2 == 1);  // the inverse (mod 2<<64) only exists for odd values
uint64_t x0 = (3*a)^2;
uint64_t y = 1 - a*x0;
uint64_t x1 = x0*(1 + y);
y *= y;
uint64_t x2 = x1*(1 + y);
y *= y;
uint64_t x3 = x2*(1 + y);
y *= y;
uint64_t x4 = x3*(1 + y);
return x4;
}``````

Figure 2: C code for the 64 bit integer inverse.

For details, see the paper for a full description of the algorithm, the proof, and latency analysis.

Achieving High Throughput

We can generally achieve highest throughput by using Montgomery’s trick, since it lets us calculate more than one inverse at the cost of only one inverse and some extra multiplications. The easiest way to explain the trick is to demonstrate it as a way to calculate two inverse results. We’ll use two identities:

a-1 (a * b)-1 * b (mod n)
b-1 (a * b)-1 * a (mod n)

The identities let us inexpensively find the inverses of both a and b (mod n). We first calculate the modular multiplication c a*b (mod n), and then calculate just a single modular inverse c-1, which we use as a modular multiplier to get our desired inverses a-1 and b-1. The total cost is one inverse and three modular multiplications.

In practice, the modular multiplications above will often be equivalent to plain multiplications on a CPU. Since for this discussion we are interested in finding the inverse modulo some power of 2, we know our modulus n is a power of 2. For the common case that we set the power equal to the CPU’s native bit width (e.g. resulting in n = 232 or 264), each modular multiplication above could be performed by a single basic CPU multiplication instruction. Calculating the single inverse is usually the only “hard” part.

The following code demonstrates Montgomery’s trick to calculate two 64 bit inverses:

``````void dual_inverse(uint64_t a, uint64_t b, uint64_t* inv_a, uint64_t* inv_b)
{
assert(a%2 == 1);  // the inverse (mod 2<<64) only exists for odd values
assert(b%2 == 1);
uint64_t c = a*b;
uint64_t c_inv = multiplicative_inverse(c);
*inv_a = c_inv * b;
*inv_b = c_inv * a;
}``````

Figure 3: C code to efficiently calculate two 64 bit integer inverses.

This trick can be extended to obtain a large number of modular inverses. The cost remains just a single inverse calculation, accompanied by a large number of modular multiplications.

Copyright (c) 2022 by Jeffrey Hurchalla.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Posted in Uncategorized | 1 Comment

Computing the Modular Multiplicative Inverse

We’ve previously explored the Extended Euclidean algorithm, and it’s easy to use a special case of it to implement the modular multiplicative inverse. We’ll start by reproducing the final function from an older post that derived a correct and efficient implementation for the Extended Euclidean algorithm, using inputs with an unsigned integral type:

``````template <class S, class U>
void extended_euclidean(U a, U b, U& gcd, S& x, S& y)
{
static_assert(std::numeric_limits<S>::is_integer);
static_assert(std::numeric_limits<S>::is_signed);
static_assert(std::numeric_limits<U>::is_integer);
static_assert(!(std::numeric_limits<U>::is_signed));
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value);
S x1=1, y1=0, x0=0, y0=1;
U a1=a, a2=b, q=0;

while (a2 != 0) {
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1; y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;

q = a0/a1;
a2 = a0 - q*a1;
}
x = x1;
y = y1;
gcd = a1;
}``````

Figure 1: Extended Euclidean Algorithm, Unsigned Integer Inputs.

There’s something particularly interesting about Figure 1, given that it computes the greatest common divisor d and the Bézout coefficients x and y. By starting from the definition of the Bézout coefficients and then taking the remainder of both sides modulo a (which requires a > 0), we can see that
a*x + b*y == gcd(a,b) == d
a*x + b*y == d
a*x + b*y ≡ d    (mod a)
b*y ≡ d   (mod a)

If a and b are coprime, then gcd(a,b) == d == 1, and continuing from above,
b*y ≡ 1   (mod a)
So under this condition, y is the modular multiplicative inverse (modulo a) of the input b.

Without knowing whether a and b are coprime, let’s assume some integer z exists such that
b*z ≡ 1   (mod a)
By the definition of congruence, this would mean that some integer k exists such that
b*z + a*k = 1
As above, let d be the greatest common divisor of a and b. By definition d divides both a and b, and so d divides b*z + a*k = 1, and thus d divides 1. Since d divides 1, d must be either -1 or 1. And since the greatest common divisor can not be negative, this means d is 1. Thus, if an integer z exists that is the modular multiplicative inverse (modulo a) of b, then a and b are coprime. Putting this together with the paragraph above, we know that the modular multiplicative inverse (modulo a) of b exists if and only if a and b are coprime.

So let us assume that a and b are coprime, and that therefore the inverse exists. We just showed for this case that y is the modular multiplicative inverse (modulo a) of the input b.  And we can see that Figure 1 calculates and assigns the value of y upon completion.  We can also see in Figure 1 that x and all the calculations of x0, x1, x2 go completely unused when we are finding the modular multiplicative inverse. We noted that we must require a > 0 since we are using a for the modulo operation, but we will take the requirement further and assert that a > 1, since it makes our final work easier and a modulus of 1 would be pointless anyway (everything would be always congruent to 0). We can use these insights to change Figure 1 into a function to calculate the modular multiplicative inverse:

``````template <class S, class U>
void modular_multiplicative_inverse(U a, U b, U& gcd, S& y)
{
static_assert(std::numeric_limits<S>::is_integer);
static_assert(std::numeric_limits<S>::is_signed);
static_assert(std::numeric_limits<U>::is_integer);
static_assert(!(std::numeric_limits<U>::is_signed));
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value);
assert(a > 1);
S y1=0, y0=1;
U a1=a, a2=b, q=0;

while (a2 != 0) {
S y2 = y0 - static_cast<S>(q)*y1;
y0=y1;
U a0=a1;
y1=y2;
a1=a2;

q = a0/a1;
a2 = a0 - q*a1;
}
y = y1;
gcd = a1;
}``````

Figure 2: Modular Multiplicative Inverse (Can Be Improved).

We can improve the function above by returning the modular multiplicative inverse rather than setting y to its value.  We can mostly remove the gcd as well; its main value to a caller is that it lets a caller know if the input a and the modulus are coprime (if not, the modular multiplicative inverse of a would not exist).  Instead, as a convenience to a client of the function we can return zero if the inverse does not exist, since a valid modular multiplicative inverse can never be zero.  After making these changes, we have the following:

``````template <class S, class U>
S modular_multiplicative_inverse(U a, U b)
{
static_assert(std::numeric_limits<S>::is_integer);
static_assert(std::numeric_limits<S>::is_signed);
static_assert(std::numeric_limits<U>::is_integer);
static_assert(!(std::numeric_limits<U>::is_signed));
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value);
assert(a > 1);
S y1=0, y0=1;
U a1=a, a2=b, q=0;

while (a2 != 0) {
S y2 = y0 - static_cast<S>(q)*y1;
y0=y1;
U a0=a1;
y1=y2;
a1=a2;

q = a0/a1;
a2 = a0 - q*a1;
}
S y = y1;
U gcd = a1;
if (gcd == 1)
return y;
else
return 0;
}``````

Figure 3: Modular Multiplicative Inverse (Almost Done).

This still can be improved. We might as well rename a to modulus, since that’s what it is. We will also rename b to val in the function interface. And returning a signed value is a bit strange, since it means we might be returning a negative value. We can note in passing that a negative return value can be correct despite its strangeness – so long as the returned value belongs to the congruence class of the answer, modulo the modulus, it is technically correct. Nevertheless, it would be more convenient for a client of the function if we return the smallest non-negative integer belonging to that congruence class and return it as type U. And since the function interface won’t use template type S anymore, we can drop it from the template parameters. These changes give us the final implementation:

``````template <class U>
U modular_multiplicative_inverse(U modulus, U val)
{
static_assert(std::numeric_limits<U>::is_integer);
static_assert(!(std::numeric_limits<U>::is_signed));
using S = typename std::make_signed<U>::type;
assert(modulus > 1);
S y1=0, y0=1;
U a1=modulus, a2=val, q=0;

while (a2 != 0) {
S y2 = y0 - static_cast<S>(q)*y1;
y0=y1;
U a0=a1;
y1=y2;
a1=a2;

q = a0/a1;
a2 = a0 - q*a1;
}
S y = y1;
U gcd = a1;
if (gcd == 1)
return (y >= 0) ? static_cast<U>(y) : static_cast<U>(y + modulus);
else
return 0;
}``````

Figure 4: Final Modular Multiplicative Inverse.

We do need to explain the conditional at the end, which returns either y or y+modulus depending on whether y is negative or not. How do we know that y+modulus will be non-negative? And how do we know that the returned value will be the smallest non-negative integer that belongs to the answer’s congruence class? An earlier blog post on bounds for the Extended Euclidean Algorithm had a section called Final Bounds, which referred to combined essential assertion results. One of those assertion results wasn’t mentioned in the blog entry since it was only a step toward the final bounds discussed by the blog entry, but for our purposes now that assertion result gives us the information we need. It showed that at the end of the Extended Euclidean function,
abs(y0) <= max(1,a/2)
Figure 4 is derived from the unsigned version of the Extended Euclidean algorithm, which in turn is derived from the normal Extended Euclidean implementation. So this bound applies to it, but we need to account for changes in variable names.  y0 from the essential asserts file corresponds to y in Figure 4, and a corresponds to modulus, and so we have in Figure 4 the final bounds
abs(y) <= max(1,modulus/2)
Figure 4 asserts at its start that modulus > 1, which means modulus/2 will be greater than or equal to 1, and thus we can omit the max() portion, giving us our final bounds as
abs(y) <= modulus/2
Another way to state this is
modulus/2 <= y <= modulus/2
It is easy to see that if y is negative, adding modulus to it will provide a non-negative result that is less than modulus. And if y is non-negative, then y is obviously less than modulus. Thus our returned result will be the smallest non-negative integer that belongs to the answer’s congruence class.

We’ve now got a nice implementation of the modular multiplicative inverse, derived from the unsigned version of the Extended Euclidean algorithm.

Posted in Uncategorized | 1 Comment

[C/C++] Surprises and Undefined Behavior From Unsigned Integer Promotion

“There are far too many integer types, there are far too lenient rules for mixing them together, and it’s a major bug source, which is why I’m saying stay as simple as you can, use [signed] integers til you really really need something else.” -Bjarne Stroustrup, (Q&A at 43:00)

“Use [signed] ints unless you need something different, then still use something signed until you really need something different, then resort to unsigned.” -Herb Sutter, (same Q&A)

This is good and easy advice. Though if you’re curious, you probably also want to know why it’s good advice. And if you really need an unsigned type, you probably want to know if there’s anything you can or should do to avoid bugs; for solutions, just skip ahead to the Recommendations.

Surprises

You could run into a few problems with unsigned integer types. The following code might be quite surprising:

```#include <limits>
#include <iostream>

int main()
{
// assume this static assert passes
static_assert(sizeof(unsigned short) < sizeof(int));
unsigned short one = 1;
unsigned short max = std::numeric_limits<unsigned short>::max();

unsigned short sum = one + max;
if (sum == one + max)
std::cout << "sum = one + max, and sum == one + max\n";
else
std::cout << "sum = one + max, but sum != one + max\n";
return 0;
}
```

Figure 1

If you run it you’ll get the output

```sum = one + max,  but sum != one + max
```

Here’s a link to it on wandbox if you want to try it out. There’s no undefined behavior in the program or compiler bugs.

The surprising result occurs due to “integral promotion”. You can see the provided link for the precise integer promotion rules, but in practice, the rules mean that during a math operation or comparison, any integer types smaller (in bit-width) than type int will be implicitly converted by the compiler to type int.

This means that in Figure 1, if the static_assert passes, the assignment

```   unsigned short sum = one + max;
```

will be translated by the compiler into

```   unsigned short sum = (unsigned short)((int)one + (int)max);
```

Let’s work with concrete numbers and assume your compiler uses a 16 bit unsigned short type and a 32 bit int type – this is very common, though not universal. In Figure 1, the unsigned short variable max will be assigned the value 65535, and will retain this value when converted to type int. The variable one will be assigned the value 1, and will retain this value after being converted to type int. The addition of these two (converted/promoted) type int values results in the value 65536, which is easily representable in a 32 bit int type, and so there won’t be any overflow or undefined behavior from the addition. The compiler will cast that result from type int to type unsigned short in order to assign it to variable sum. The value 65536 isn’t representable in a 16 bit unsigned short (sum‘s type), but the conversion is well-defined in C and C++; the conversion gets performed modulo 2N, where N is the bit width of type unsigned short. In this example, N=16 and thus the conversion of 65536 will result in the value 0, which will be assigned to sum.

A similar process occurs on the line

```if (sum == one + max)
```

except that there isn’t any final narrowing conversion back to unsigned short. Here’s what happens: As before, one and max are promoted to type int prior to addition, resulting in a type int summation value of 65536. When evaluating the conditional, the left hand side (sum) is promoted to type int, and the right hand side (the summation 65536) is already type int. 65536 compares as unequal to sum, since sum was assigned the value 0 earlier. A narrowing conversion to unsigned short took place when sum was assigned. However, the conditional operator works with operands of type int, and so the right hand side summation never gets a similar conversion down to unsigned short. It stays as type int. We end up with the unexpected output “`sum = one + max, but sum != one + max"`

Hidden integral promotions and narrowing conversions are subtle, and the results can be surprising, which is usually a very bad thing. For *signed* integral types, there generally isn’t any problem with promotion. It’s the promotion of *unsigned* integral types that’s problematic and bug-prone.

Let’s look at a second surprise from unsigned integer promotion:

```#include <limits>
#include <iostream>

int main()
{
unsigned short one = 1;
unsigned short max = std::numeric_limits<unsigned short>::max();
unsigned int sum = one + max;
std::cout << "sum == " << sum << "\n";
return 0;
}
```

Figure 2

If you run Figure 2 on a system where unsigned short and int are both 16bit types, the program will output “sum == 0”. Since unsigned short and int are the same size, the operands one and max will not be promoted, and the addition will overflow in a well defined manner resulting in 0. If on the other hand you run Figure 2 on a system where unsigned short is a 16bit type and int is a 32 bit type, the operands one and max will be promoted to type int prior to the addition and no overflow will occur; the program will output “sum == 65536”. The integral promotion results in non-portable code.

Undefined Behavior

Now that we’re familiar with integral promotion, let’s look at a simple function:

```unsigned short multiply(unsigned short x, unsigned short y)
{
// assume this static assert passes
static_assert(sizeof(unsigned short) * 2 == sizeof(int));

unsigned short result = x * y;
return result;
}
```

Figure 3

Despite all lines seeming to involve only type unsigned short, there is a potential for undefined behavior in Figure 3 on line 6 due to possible signed integer overflow on type int. The compiler will implicitly perform integral promotion on line 6, so that the multiplication will involve two (promoted/converted) operands of type int, not of type unsigned short. If for our compiler unsigned short is 16 bit and int is 32 bit, then any product of x and y larger than 2^31 will overflow the signed type int. And unfortunately, signed integral overflow is undefined behavior. It doesn’t matter that overflow of unsigned integral types is well-defined behavior in C and C++. No multiplication of values of type unsigned short ever occurs in this function.

Let’s finally look at a contrived toy function:

```unsigned short toy_shift(unsigned short x, unsigned short y)
{
// assume this static assert passes
static_assert(sizeof(unsigned short) < sizeof(int));

unsigned short result = (x-y) << 1;
if (x >= y)
return 0;
return result;
}
```

Figure 4

The subtraction operator in Figure 4 has two unsigned short operands x and y, both of which will be promoted to type int. If x is less than y then the result of the subtraction will be a negative number, and left shifting a negative number is undefined behavior.  Keep in mind that if the subtraction had involved unsigned integral types (as it would appear on the surface), the result would have underflowed in a well-defined manner and wrapped around to become a large positive number, and the left shift would have been well-defined. But since integral promotion occurs, the result of a left shift when x is less than y would be undefined behavior.

An interesting consequence of the potential for undefined behavior in Figure 4 is that any compiler would be within its rights to generate “optimized” object code for the function (if the static_assert succeeds) that is very fast and almost certainly unintended by the programmer, equivalent to

```unsigned short toy_shift(unsigned short x, unsigned short y)
{
return 0;
}
```

To see why, we need to understand how modern compilers can use undefined behavior. For better or worse, modern C/C++ compilers commonly use undefined behavior to optimize, by taking advantage of the fact that undefined behavior is impossible in any valid code. It’s somewhat controversial whether compilers really ought to ever do this, but the reality is that in the present day it’s an extremely common optimization technique, and nothing in the C/C++ standards forbids it. With regard to Figure 4, this means a compiler could assume the conditional (x >= y) in toy_shift() will always succeed – because the alternative would be that the function had undefined behavior from left shifting a negative number, and the compiler knows that undefined behavior is impossible for valid code. The compiler always assumes that we have written valid code unless it can prove otherwise (in which case we’d get a compiler error message). We might incorrectly think that the compiler can’t make any assumptions about the arguments to the toy_shift() function because it can’t predict what arbitrary calling code might do, but the compiler can make some limited predictions. It can assume that calling code will never use any arguments that result in undefined behavior, because getting undefined behavior would be impossible from valid calling code. The compiler can therefore conclude that with valid code, there is no scenario in which the conditional could possibly fail, and it could use this knowledge to “optimize” the function, producing object code that simply returns 0. [For scant reassurance, I haven’t seen a compiler do this (yet) for Figure 4.]

The Integral Types Which May be Promoted

Integral promotion involves some implementation-defined behavior.  It’s up to the compiler to define the exact sizes for the types char, unsigned char, signed char, short, unsigned shortint, unsigned int, long, unsigned long, long long, and unsigned long long.  The only way to know if one of these types has a larger bit-width than another is to check your compiler’s documentation, or to compile/run a program that outputs the sizeof() result for the types.  Thus it’s implementation defined whether int has a larger bit width than unsigned short, and by extension it’s implementation defined whether unsigned short will be promoted to type int.  The standard does effectively guarantee that types int, unsigned int, long, unsigned long, long long, and unsigned long long will never be promoted.  Floating point types of course are never subjected to integral promotion.

But this leaves far more integral types than you might expect which may (at least in principle) be promoted. A non-exhaustive list of types that might be promoted is

char, unsigned char, signed char, short, unsigned short, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, int_fast8_t, uint_fast8_t, int_least8_t, uint_least8_t, int_fast16_t, uint_fast16_t, int_least16_t, uint_least16_t, int_fast32_t, uint_fast32_t, int_least32_t, uint_least32_t, int_fast64_t, uint_fast64_t, int_least64_t, uint_least64_t

Surprisingly, all the sized integral types (int32_t, uint64_t, etc) are open to possible integral promotion, dependent upon the implementation-defined size of int.  For example, it’s plausible that there could someday be a compiler that defines int as a 64 bit type, and if so, int32_t and uint32_t will be subject to promotion to that larger int type.  Compiler writers are likely to understand this could break older code that implicitly assumes uint32_t won’t be promoted, but there’s no guarantee. In theory there’s nothing in the standard that would prevent a future compiler from defining int as even a 128 bit type, and so we have to include int64_t and uint64_t in the list of types that could at least in theory be promoted, all dependent on how the compiler defines type int.

Very realistically in code today, unsigned char, unsigned short, uint8_t and uint16_t (and also uint_least8_t, uint_least16_t, uint_fast8_t, uint_fast16_t) should be considered a minefield for programmers and maintainers.  On most compilers (defining int as at least 32 bit), these types don’t behave as expected.  They will usually be promoted to type int during operations and comparisons, and so they will be vulnerable to all the undefined behavior of the signed type int. They won’t be protected by any well-defined behavior of the original unsigned type, since after promotion the types are no longer unsigned.

Recommendations

Sometimes you really do need unsigned integers. Unsigned int, unsigned long, and unsigned long long are all more or less safe since they’re never promoted. But if you use an unsigned type from the last section, or if you use generic code that expects an unsigned integer type of unknown size, that type can be dangerous to use due to promotion.

For C, there’s a workable solution to the unsigned integer promotion problem. Whenever you use a variable with unsigned type smaller than unsigned int, add 0u to it within parentheses. For example, when multiplying two unsigned short variables a and b, you can write (a+0u)*(b+0u). Since 0u has type unsigned int, adding it to a or b will effectively manually “promote” the operand to unsigned int if it has type smaller than unsigned int. This is useful both for mathematical operations and comparisons to ensure that operands won’t get unexpectedly promoted to type int. A disadvantage is maintainers may not understand its meaning when seeing it. An alternative you can use is to explicitly cast an operand to unsigned int, which works fine for unsigned char, unsigned short, uint8_t and uint16_t. Avoid using this particular solution on any operand of type uint32_t (or any even larger fixed width type), since unsigned int has an implementation defined size (of at least 16 bits) and this size could be smaller than uint32_t on some systems – potentially resulting in an undesired narrowing cast.

For C++, there’s a fairly good solution. You can use the following helper class to get a safe type that you can use as a safe destination type for explicit casts on your unsigned (or generic) integer types during mathematical operations and comparisons. The explicit casting prevents implicit promotion of unsigned types to int. The effect is that unsigned types smaller than unsigned int will be (manually) promoted to unsigned int, and unsigned types larger than unsigned int will be unchanged. This helper class provides a safe and relatively easy way to achieve well-defined behavior with all unsigned integer types, as we’ll see by example.

```#include <limits>

template <class T>
struct safely_promote_unsigned {
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(!std::numeric_limits<T>::is_signed, "");
public:
// If T can be promoted, then 'type' will be the
// unsigned version of T's promoted type.
// Otherwise 'type' will be the same as T.
using type = decltype(0u + static_cast<T>(0));
};
template <class T>
using safely_promote_unsigned_t =
typename safely_promote_unsigned<T>::type;
```

To illustrate the use of safely_promote_t, let’s write a template function version of Figure 3 that is free from any undefined behavior when T is an unsigned integer type:

```template <class T>
T unsigned_multiply(T x, T y)
{
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(!std::numeric_limits<T>::is_signed, "");
using U = safely_promote_unsigned_t<T>;
T result = static_cast<U>(x) * static_cast<U>(y);
return result;
}
```

Of course the best solution of all came from the introductory advice: use a signed integral type instead of unsigned types whenever you can.

Reference

The C++17 standard has multiple sections that involve integral promotion. For reference, here are the excerpts/summaries from the relevant parts of the C++17 standard draft:

7.6 Integral promotions [conv.prom]
1 A prvalue of an integer type other than bool, char16_t, char32_t, or wchar_t whose integer conversion rank (7.15) is less than the rank of int can be converted to a prvalue of type int if int can represent all the values of the source type; otherwise, the source prvalue can be converted to a prvalue of type unsigned int.

8 Expressions [expr]
11 Many binary operators that expect operands of arithmetic or enumeration type cause conversions […These are] called the usual arithmetic conversions.
[… If neither operand has scoped enumeration type, type long double, double, or float,] the integral promotions (7.6) shall be performed on both operands.

8.3.1 Unary operators [expr.unary.op] (parts 7, 8, 10)
[For the unary operators +, -, ~, the operands are subject to integral promotion.]

8.6 Multiplicative operators [expr.mul]
[Binary operators *, /, %]
2 The usual arithmetic conversions are performed on the operands and determine the type of the result.

1 The additive [binary] operators + and – group left-to-right. The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.

8.8 Shift operators [expr.shift]
[For the binary operators << and >>, the operands are subject to integral promotion.]

8.9 Relational operators [expr.rel]
[<, <=, >, >=]
2 The usual arithmetic conversions are performed on operands of arithmetic or enumeration type

8.10 Equality operators [expr.eq]
[==, !=]
6 If both operands are of arithmetic or enumeration type, the usual arithmetic conversions are performed on both operands

8.11 Bitwise AND operator [expr.bit.and]
1 The usual arithmetic conversions are performed;

8.12 Bitwise exclusive OR operator [expr.xor]
1 The usual arithmetic conversions are performed;

8.13 Bitwise inclusive OR operator [expr.or]
1 The usual arithmetic conversions are performed;

Posted in Uncategorized | 3 Comments

Implementing the Extended Euclidean Algorithm with Unsigned Inputs

As the previous post showed, it’s possible to correctly implement the Extended Euclidean Algorithm using one signed integral type for all input parameters, intermediate variables, and output variables. None of the calculations will overflow. The implementation was given as follows:

``````template <class T>
void extended_euclidean(const T a, const T b, T* pGcd, T* pX, T* pY)
{
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(std::numeric_limits<T>::is_signed, "");
assert(a >= 0 && b >= 0);   // precondition
T x0=1, y0=0, a0=a;
T x1=0, y1=1, a1=b;

while (a1 != 0) {
T q = a0/a1;
T a2 = a0 - q*a1;
T x2 = x0 - q*x1;
T y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}``````

Figure 1: C++ Extended Euclidean Function, Signed Integer Inputs.

The precondition a >= 0 && b >= 0 is required because the proof from the last post didn’t consider any cases where a or b was negative. I haven’t investigated whether the Extended Euclidean algorithm might be valid for any and all cases of negative inputs, so I won’t speculate about what might be possible for those cases. Instead let’s focus on the proven case a >= 0 && b >= 0.

It seems a bit wasteful to use signed integers for inputs a and b when the function precondition disallows them from ever being negative. Why not just use unsigned integer inputs that have the same precision as the signed integer inputs would have had, and double the permissible input range? If we need more persuasion, pretty likely our primary application for the Extended Euclidean Algorithm is to implement the modular multiplicative inverse. It’s common to use unsigned integral types for modular arithmetic. Why not increase compatibility by keeping to that convention?

It turns out it’s very much possible to correctly implement a modified version of the algorithm using unsigned inputs. You can skip directly to the implementation code or to the optimized version of that code if you like.

Adapting the Algorithm to Use Unsigned Inputs

Changing the inputs in Figure 1 to unsigned types seems like it might be dangerous – we might get an overflow in one of the variables or calculations inside the function. The last post proved bounds for those variables and calculations, and for an all-signed implementation it nicely ruled out overflow. Those bounds will guide us now in changing Figure 1 to safely use unsigned inputs.

From the bounds proven in the last post, we know that for every loop iteration in Figure 1,
`0 <= q <= max(a,b)`
`0 <= a2 < max(a,b)`

Assuming we go ahead and change the type for inputs a and b to be unsigned, the bounds above make it clear that in order to be certain to avoid overflow we need to change q and a2 to also use that same unsigned type. Looking at Figure 1, we can also see that the contents of a2 propagate to a1, a0, and *pGcd. As a result, we need to change all those variables to use that same unsigned integer type.

The more problematic variables are x2 and y2 (and likewise x1, y1, x0, y0, *pX, and *pY, since the values of x2 and y2 propagate to them).
Displaying more of the bounds proven in the last post, throughout every loop iteration in Figure 1 we know that
`abs(x2) <= b`
`abs(y2) <= max(1,a)`

On the one hand, we can see that x2 and y2 must remain signed types, since the bounds above don’t prevent them from taking on negative values. But on the other hand, the bounds above allow them to become as large as a or b, and because a and b are now an unsigned type, the largest possible value for a or b would overflow the corresponding signed type (which we determined x2 and y2 must use). Fortunately we can resolve this conflict, by using less generally applicable but tighter bounds for these variables. Let’s refer to the second set of proven bounds from the last post. For every loop iteration other than the final iteration (when a2 == 0), we know the following tighter bounds hold:
`abs(x2) <= max(1,b/2)`
`abs(y2) <= max(1,a/2)`

Superficially, it seems like these bounds should be enough to guarantee we can use signed types for x2 and y2 without any problems, since b/2 and a/2 will always fit in their corresponding signed type. The only loop iteration not covered by these bounds is the final iteration, and in Figure 1 we can see that the calculated values of x2 and y2 on that last iteration are never used anywhere. So seemingly overflow (if it happened) would be fine at that point. But in truth it’s unacceptable, because overflow of a signed integral type is undefined behavior in C and C++. Upon overflow on the last iteration, a reasonable possibly is everything would be okay, but another reasonable possibility is a trap would occur that crashes the program. Anything is allowed after invoking undefined behavior, so in the end it’s hopeless to try to reason about whether the program would be okay. The only safe thing to do is to view any undefined behavior as a fatal bug that must be fixed.

Since we can not allow undefined behavior, we must ensure signed integer overflow is completely impossible in the function. Fortunately this is easy – we can simply test if we are in the final loop iteration, through an early check of the loop conditional, and break out if so. Most of the work in the final iteration is irrelevant since it’s never used later. The only exception is the reassignments (x0=x1; y0=y1; a0=a1), which still need to be done before causing a break.

We’re not quite finished, but let’s incorporate all the changes discussed so far, so that we have an in-progress implementation:

``````template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value, "");
S x0=1, y0=0;
U a0=a;
S x1=0, y1=1;
U a1=b;

while (a1 != 0) {
U q = a0/a1;
U a2 = a0 - q*a1;
if (a2 == 0) {
x0=x1; y0=y1; a0=a1;
break;
}
S x2 = x0 - q*x1;
S y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}``````

Figure 2: In-Progress (And Broken) Implementation, Unsigned Integer Inputs.

Figure 2 has a hidden problem that we’ll need to fix. We determined above that q must be unsigned and x1 must be signed, and indeed Figure 2 reflects this. As a result, during the calculation of (q*x1), the compiler will implicitly convert x1 to the unsigned type of q in order to carry out the calculation. Quoting from the C++17 standard section 8 [expr] (11.5.3):
“Many binary operators that expect operands of arithmetic type cause conversions… The following rules shall be applied… if the operand that has unsigned integer type has rank equal to the rank of the type of the other operand, the operand with signed integer type shall be converted to the type of the operand with unsigned integer type.”
[The C standard specifies very similar behavior.]
This is a problem – x1 may legitimately hold a negative value, and yet the compiler will be implicitly converting it to an unsigned integer type! Converting a negative integer value to an unsigned type is well defined in the C and C++ standards, but it rarely does what a programmer wants when it’s unintentional. This conversion certainly wasn’t planned in Figure 2. For the multiplication (q*x1), what we really wanted was to convert the unsigned q variable to signed, and to leave x1 signed. Fortunately we can make this happen without much trouble by explicitly casting q to signed: for example, static_cast<S>(q). This explicit cast will also ensure that the compiler will perform no implicit conversions on x1, leaving it as signed, since the types of q (after the cast) and x1 will match.

The exact same problem we just discussed exists in Figure 2 for the interim calculation of (q*y1) too, and can be solved again with an explicit cast. No casts or changes are needed for the interim calculation (q*a1) – it has no problem with conversions since q and a1 both have the same unsigned type.

Explicitly casting q to its corresponding signed integral type works well in the code of Figure 2. We know from the bounds proven in the last post that on every loop iteration except the final iteration,
`0 <= q <= max(a,b)/2`
It follows that for all loop iterations other than the final iteration, q will be less than or equal to the largest possible unsigned value divided by two. Thus, for those iterations, the value of q can be represented in the corresponding signed type, which means the explicit cast of q to that signed type will not overflow. On the final iteration we can see that Figure 2 has a break which occurs before it gets to the only interim calculations (q*x1 and q*y1) which require the explicit cast of q. And so no casts of q will take place on the final loop iteration. Thus none of the casts of q will ever overflow the signed result type on any loop iteration.

Let’s incorporate the explicit casts of q, which will fix the code from Figure 2.

A Correct Implementation

``````template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value, "");
S x0=1, y0=0;
U a0=a;
S x1=0, y1=1;
U a1=b;

while (a1 != 0) {
U q = a0/a1;
U a2 = a0 - q*a1;
if (a2 == 0) {
x0=x1; y0=y1; a0=a1;
break;
}
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}``````

Figure 3: Correct Implementation, Unsigned Integer Inputs.

At this point, we have correct code in Figure 3. The remainder of this section proves that none of the interim calculations in Figure 3 can overflow, which finishes the code analysis and lets us optimize.

The interim calculations are q*a1, q*x1, and q*y1. Using the proven bounds from the last post, we can see that for every loop iteration the following bound holds:
`0 <= (q*a1) <= max(a,b)`
And for every loop iteration other than the final iteration (when a2 == 0), the following bounds hold:
`abs(q*x1) <= max(1,b/2)`
`abs(q*y1) <= max(1,a/2)`

If the result of q*a1 is representable in unsigned type U, then we know it does not overflow. The bound 0 <= (q*a1) <= max(a,b) shows that this requirement is always satisfied, because max(a,b) is less than or equal to std::numeric_limits<U>::max().

Similarly, if the result of static_cast<S>(q)*x1 is representable in signed type S, then it does not overflow. We’ll need to analyze two cases: the case where this result is non-negative, and the case where it is negative. Let’s first define the constants u_max = std::numeric_limits<U>::max(), s_max = std::numeric_limits<S>::max(), and s_min = std::numeric_limits<S>::min(). The C/C++ standards effectively mandate that u_max/2 <= s_max and -u_max/2 >= s_min.

Let’s consider first the case where 0 <= static_cast<S>(q)*x1. Repeating the bound given above, we know that for all loop iterations other than the final,
`abs(q*x1) <= max(1,b/2)`
For this case we can remove the absolute value from the bound, so that we have
`0 <= static_cast<S>(q)*x1 <= max(1,b/2)`
The input parameter b has unsigned type U, so we can expand this to
`0 <= static_cast<S>(q)*x1 <= max(1,b/2) <= u_max/2 <= s_max`
or in simplified form,
`0 <= static_cast<S>(q)*x1 <= s_max`
which shows us that for this case, static_cast<S>(q)*x1 will always be representable in type S, provided that we are not in the final loop iteration.

Let’s now consider the other case, where 0 > static_cast<S>(q)*x1. We can remove the absolute value from the bound, giving us for this case:
`-static_cast<S>(q)*x1 <= max(1,b/2)`
and thus
`0 > static_cast<S>(q)*x1 >= -max(1,b/2)`
The input parameter b has unsigned type U, so we can expand this to
`0 > static_cast<S>(q)*x1 >= -max(1,b/2) >= -u_max/2 >= s_min`
which shows us that for this case, static_cast<S>(q)*x1 will always be representable in type S, provided that we are not in the final loop iteration.

Since in both cases static_cast<S>(q)*x1 is always representable in type S, we know that this multiplication will not overflow, so long as we are not in the final loop iteration. If we are in the final loop iteration, Figure 3 shows that we will break out of the loop before reaching this calculation, and so no problems will be possible on the final loop iteration. The same reasoning as above can be applied to show that no overflow will occur in Figure 3 for the multiplication of static_cast<S>(q)*y1.

We’ve shown that none of the calculations or conversions in Figure 3 will overflow, and we can conclude that the implementation in Figure 3 is probably correct. You can see the Conclusion for a test project if you wish to run it.

The one fairly minor complaint we might have with Figure 3 is that the conditional branch inside the loop isn’t ideal from a performance standpoint. It would be nice if we could remove the conditional branch. Indeed we can, by re-arranging the code.

Optimizing the Code

We can remove the conditional branch inside the loop of Figure 3 by rotating statements from the bottom of the loop to the top of the loop. Each rotation will require changing the code that precedes and follows the loop (and sometimes the loop condition too), so that the meaning of the function stays the same. Aside from the rotation, nothing inside the loop should change. The end goal of the rotations is to get the position of the nested conditional branch to be at the end of the loop, so that we can merge it into the loop condition.

Let’s first rotate the reassignments from the end of the loop in Figure 3 to the top of the loop. [If this description is hard to follow, it may help to compare Figure 3 with Figure 4.] The rotation affects the code after the loop, since x0, y0, and a0 won’t be reassigned upon the end of the final loop iteration. So to compensate, *pX, *pY, and *pGcd will now need to be set to the values of x1, y1, and a1. This change to *pX, *pY, and *pGcd will need to also be valid if the loop is never taken at all, so we’ll need to change the initial values of x1, y1, and a1 to the initial values in Figure 3 given to x0, y0, and a0. Fortunately, since we will have already done the rotation that places the loop reassignments at the top of the loop, this last change also ensures the correct values will get reassigned upon the first entry to the loop. At least in part. We haven’t discussed a second part of the effect of having the reassignments at the top of the loop: the reassignments use x2, y2, and a2, which are variables that haven’t been declared or initialized yet. We’ll fix this by adding declarations and initializations to the start of the function for x2, y2, and a2, setting them to the initial values that had formerly been given to x1, y1, and a1 in Figure 3. As a result, upon first entry of the loop, the reassignments at the top of the loop will re-set the correct initial values, leaving the meaning of the function unchanged.

There’s one last detail. We’ll need to change the loop condition to refer to a2 instead of a1, again due to the rotation of the reassignment statements. Since we’ll have already declared and initialized a2 at the top of the function (to the initial value formerly used in Figure 3 by a1), this preserve the meaning of the function perfectly.

It’s easiest to see all this in actual code. By rotating the loop’s reassignments in Figure 3, we’ll have changed the implementation to the following code:

``````template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x2=0, y2=1;
U a2=b;

while (a2 != 0) {
S x0=x1;
S y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;

U q = a0/a1;
a2 = a0 - q*a1;
if (a2 == 0)
break;
x2 = x0 - static_cast<S>(q)*x1;
y2 = y0 - static_cast<S>(q)*y1;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}``````

Figure 4: Optimization Step 1, Unsigned Integer Inputs.

We haven’t yet removed the conditional branch inside the loop, but we’re closer to the goal. We still need to rotate the calculations of x2 and y2 from the bottom of the loop to the top, and again fix the code before and after the loop to preserve the meaning of the function. As it turns out, after rotating the calculations of x2 and y2 there’s no need to change any code after the function. But now that the calculations are at the top of the loop, the variable q gets used before it has been declared or initialized. In order to preserve the meaning of the function, we’ll need to add a declaration and initialization of q at the start of the function. Initializing q to zero works quite well. On the first entry of the loop, the calculations of x2 and y2 will thereby contain a multiplication by zero (since q == 0), effectively reducing those calculations on the first iteration to reassignments from x0 and y0. This means we’ll need to add to the start of the function declarations and initializations for x0 and y0, setting them to the initial values that Figure 4 had used for x2 and y2.

After making these changes to Figure 4, we have the following code:

``````template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x0=0, y0=1;
U a2=b;
U q=0;

while (a2 != 0) {
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1;
y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;

q = a0/a1;
a2 = a0 - q*a1;
if (a2 == 0)
break;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}``````

Figure 5: Optimization Step 2, Unsigned Integer Inputs.

We can now get rid of the conditional branch at the end of the function, since it’s redundant. Making this change and cleaning up the code a little, we get the following optimized function for the Extended Euclidean Algorithm with unsigned integer inputs:

An Optimized Implementation

``````template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<typename std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x0=0, y0=1;
U a2=b, q=0;

while (a2 != 0) {
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1; y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;

q = a0/a1;
a2 = a0 - q*a1;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}``````

Figure 6: Optimized Implementation, Unsigned Integer Inputs.

Conclusion

The Extended Euclidean algorithm can be correctly and very efficiently implemented using unsigned integer inputs and signed integer outputs, so long as all the function parameters have the same precision. Figure 3 provides the basic implementation and Figure 6 provides the optimized implementation. There is a header file with the complete implementation, and a CMake file that will create a test project for the implementation if you wish to run it.

Posted in Uncategorized | 11 Comments

Proof of Bounds for the Extended Euclidean Algorithm

Euclid’s method is a classic algorithm for finding the greatest common divisor (gcd) of two integers. Donald Knuth referred to it as “the granddaddy of all algorithms, because it is the oldest nontrivial algorithm that has survived to the present day.”[1] There exists a more generalized form for Euclid’s method, which is known as the Extended Euclidean Algorithm. Given two integers a and b, this algorithm finds the integer d = gcd(a,b) and two integers x and y, such that a*x + b*y = d. The usefulness of that last result, a*x + b*y = d, may seem a bit questionable at first, but without going into any details it provides a way to compute the modular multiplicative inverse.

Knuth provides an excellent description and proof for this algorithm [2]. It’s a fascinating study in using induction and invariants. Translated into C++, the algorithm he describes takes this form:

``````template <class T>
void extended_euclidean(const T a, const T b, T* pGcd, T* pX, T* pY)
{
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(std::numeric_limits<T>::is_signed, "");
T x0=1, y0=0, a0=a;
T x1=0, y1=1, a1=b;

while (a1 != 0) {
T q = a0/a1;
T a2 = a0 - q*a1;
T x2 = x0 - q*x1;
T y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}``````

Figure 1: C++ Extended Euclidean Function.

This is the best we can do for program code, but there seems to be a serious problem. The algorithm itself presents all variables as mathematical integers, and that’s an infinite set. We have to question whether the variables in Figure 1 are bounded to any finite maximum and minimum values. Particularly for x2 and y2, it’s non-obvious from Figure 1 that those variables can’t grow without bound. Until proven otherwise, we must assume that any of the variables in Figure 1 could overflow any type T we choose. (If you’d like to know the answer to this question of overflow without proof, skip ahead to the Conclusion.)

It’s fairly easy to establish bounds for a2 at least. Informally, if a and b are non-negative, we can recognize that within the loop
`a2 = a0 - q*a1`
is equivalent to
`a2 = a0 % a1`
For this case it’s easy to see that on every iteration of the loop a2 will decrease, until it reaches zero, at which point the loop terminates.

Other areas of potential overflow are harder to analyze. Potential overflow could occur for x2 and y2, and also the interim calculations q*a1, q*x1, and q*y1.

In his second edition book[3] Knuth provides an exercise for the reader in which he states two bounds, which he attributes to George E. Collins. Most likely this refers to Collins’ paper Computing Multiplicative Inverses in GF(p)[4]. Collins’ paper is the best source for bounds I’ve seen for the Extended Euclidean algorithm. It presents most of the invariants and observations needed to bound the algorithm in a full proof, and states bounds for the variables (though no interim calculation bounds). It’s restricted to 0 < b < a, and doesn’t explicitly prove its invariants, but identifying the invariants is the hardest part of this kind of proof. Collins observes that the following equalities remain invariant throughout all of the algorithm:

1.  `gcd(x0, y0) == 1`
2.  `gcd(x1, y1) == 1`
3.  `abs(x0*y1 - y0*x1) == 1`
4.  `x0*y1 - y0*x1 == -(x1*y2 - y1*x2)`
5.  `a*x0 + b*y0 == a0`
6.  `a*x1 + b*y1 == a1`

With the precondition 0 <= b < a, you can read a proof that these invariants are maintained in extended_euclidean_collins.h.

Collins also observes that x0, x1, and x2 alternate in sign, as do y0, y1, and y2. We’ll need to restate his observation as an invariant in order to use it. We might try something like
`(x0 >= 0 ? x1 < 0 : x1 <= 0)`
initially, but a conditional invariant complicates a proof – plus it just feels a bit ugly. We’ll achieve better success by introducing a sign variable, which we can call s0. Initializing s0 to 1 before the loop and negating it at the end of each loop iteration, will produce the following change to the C++ code of Figure 1 (solely for the purpose of a proof):

``````template <class T>
void extended_euclidean(const T a, const T b, T* pGcd, T* pX, T* pY)
{
T x0=1, y0=0, a0=a;
T x1=0, y1=1, a1=b;
T s0 = 1;

while (a1 != 0) {
T q = a0/a1;
T a2 = a0 - q*a1;
T x2 = x0 - q*x1;
T y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
s0 = -s0;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}``````

Figure 2: C++ Extended Euclidean Function With Sign Variable.

Since s0 switches between the values 1 and -1 on each iteration, we can now write invariants that express the alternating sign of x and y variables on each iteration:

1.  `abs(s0) == 1`
2.  `x0 == s0*abs(x0)`
3.  `x1 == -s0*abs(x1)`
4.  `y0 == -s0*abs(y0)`
5.  `y1 == s0*abs(y1)`

A sign variable lends itself well to a proof. It’s relatively simple, and it can be manipulated with the distributive property. For example, in extended_euclidean_collins.h (at annotation 57), the proof takes this assignment from Figure 2:
`y2 = y0 - q*y1`
and substitutes the expressions of invariants 10 and 11:
`y2 == -s0*abs(y0) - q*(s0*abs(y1))`
and then uses the distributive property:
`y2 == -s0*(abs(y0) + q*abs(y1))`

The (Nearly) Full Proof

So long as we require a >= 0 and b >= 0, we can prove bounds for the Extended Euclidean algorithm. We’ll do it in three parts, and we’ve already discussed some of the first part – the case of a > b. The second part is the case a < b, and the third part is the case a == b. It would be nice to also consider cases where a or b could be negative, but I expect the proof’s size would grow by a factor of two or three. It’s convenient instead to restrict the input range to non-negative values, given that even this restriction may be more general than applications need. For example, the very restricted case of 0 < b < a is sufficient to apply the Extended Euclidean algorithm to calculating the modular multiplicative inverse.

In case a reader wishes to explore the cases where a and/or b are negative, it’s worth noting that a proof that includes the negative cases will still need restrictions. If we define the constant min = std::numeric_limits<T>::min(), such a proof would at the least need to forbid the specific case where both a == min and b == min. A simple explanation is that given our definition of min, we can see that gcd(min, min) == -min, yet the value of -min is too large to represent with type T when T is represented with two’s complement.

Proof Part 1: a > b

For the case a > b, extended_euclidean_collins.h proves that all variables and interim calculations have specific bounds. This header file proves a few helper theorems at the top of the header, and then provides the Extended Euclidean function with annotations on every statement and assertion, to form the proof of bounds. Every bound is given either as a constant or in terms of a variable which is itself bounded. C asserts are used as a convenience in stating each proof step, since the asserts are testable. However the proof is analytic. There is a CMake file in the parent directory that will create the full proof’s test project if you wish to run it.

Due to the proof’s length, it’s convenient to prune everything from the header file but the algorithm statements and the essential assertion results, leaving only the results of the proof for Part 1. These essential results are the proven bounds assertions and the proven assertions that the outputs are indeed the gcd and the Bézout coefficients as required by the Extended Euclidean algorithm.

Proof Part 2: a < b

For the case a < b, a second header file proves that all variables and interim calculations have specific bounds. The logic of this proof is extremely similar to Part 1, with one large difference: the first iteration of the loop is peeled-out and placed before the loop. Peeling out the first iteration is required to state some loop invariants that only become valid for case a < b after the second loop iteration begins.

Similarly to Part 1, it helps to have a pruned header file leaving only the results of the proof. However for this case of a < b, the proof modified the original Extended Euclidean function by peeling-out the first loop iteration, so this simple pruning also uses that modified function. Ideally at this point we shouldn’t be exposed to the modification, since it’s just an artifact of the proof. Since we only care about the results of the proof and not the proof itself, we can rejoin the first iteration back into the loop. This leaves us the much more useful original function with proof results for Part 2.

Proof Part 3: a == b

The only tricky aspect to case a == b is how to define gcd(0,0). Otherwise, this part consists just of two very short proofs.

One way d = gcd(a,b) can be defined is that d is the integer that divides both a and b, such that for any integer c that divides both a and b, d >= c. Under this definition, gcd(0,0) would be undefined. A second way d = gcd(a,b) can be defined is that d is the non-negative integer that divides both a and b, such that for any integer c that divides both a and b, c divides d. Under this definition, gcd(0,0) == 0. [5] If a caller of the Extended Euclidean function expects the first definition, calling with a == 0 and b == 0 would be an implicit precondition violation, since it would be a request for the undefined gcd(0,0). We will assume that callers will never violate their implicit preconditions, and therefore we assume a caller must expect the second definition if calling with arguments a == 0 and b == 0. If the last reasoning feels unsatisfactory or too fragile, we could simply say that this Extended Euclidean function defines gcd(0,0) == 0 (or we could as a precondition explicitly disallow the special case 0 == a == b).

There is a very short header file proof for special case 0 == a == b, along with a pruned version containing only the essential proof results. The more general case 0 < a == b also has a short proof, along with a trimmed version containing only essential proof results.

Combined Proof Results

Combining the proof results for all three cases, we have general proof results for the Extended Euclidean algorithm, reproduced below. As before, the bounds are given in terms of constants and other bounded variables (the next section expresses all bounds in terms of the input parameters a and b).

``````template <class T>
void extended_euclidean_results(const T a, const T b, T* pGcd, T* pX, T* pY)
{
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(std::numeric_limits<T>::is_signed, "");
assert(a >= 0 && b >= 0);   // preconditions
T x0 = 1, y0 = 0, a0 = a;
T x1 = 0, y1 = 1, a1 = b;

while (a1 != 0) {
T q = a0/a1;
assert(0 <= q && q <= a0);
T a2 = a0 - q*a1;
assert(0 <= (q*a1) && (q*a1) <= a0);
assert(0 <= a2 && a2 < a1);
if (a2 != 0) assert(q <= a0/2);
T x2 = x0 - q*x1;
assert(abs(q*x1) <= abs(x2));
assert(abs(x1) <= abs(x2));
T y2 = y0 - q*y1;
assert(abs(q*y1) <= abs(y2));
assert(y1 == 1 || abs(y1) <= abs(y2));
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
if (a == 0 && b == 0) {
assert(x1 == 0);
assert(y1 == 1);
assert(x0 == 1);
assert(y0 == 0);
} else {
assert(gcd(a,b) >= 1);
assert(abs(x1) == b/gcd(a,b));
assert(abs(y1) == a/gcd(a,b));
assert(x0 == 1 || abs(x0) <= (b/gcd(a,b))/2);
assert(y0 == 1 || abs(y0) <= (a/gcd(a,b))/2);
}
assert(a0 == gcd(a,b));
assert(a*x0 + b*y0 == gcd(a,b));
*pX = x0;
*pY = y0;
*pGcd = a0;
}``````

Figure 3: Combined Proof Results, C++ Extended Euclidean Function.

It’s worthwhile noting that the combined results show that at the end of the algorithm, all three parts of the proof achieved a0 == gcd(a,b), and a*x0 + b*y0 == gcd(a,b) (showing that the final values of x0 and y0 are Bézout coefficients). These two assertions may be incidental to the proof of bounds, but they are important because they show that the algorithm remained valid for all three cases we considered.

Final Bounds

What we’d like best is if we could bound all variables and calculations in terms of constant values. We can can come very close to doing this. We can bound all variables and calculations in terms of the two input parameters a and b, which remain constant throughout the algorithm.

We’ll use two steps to get the desired bounds, starting with the results from Figure 3.

For the first step we replace the somewhat complicated if/else clause at the end of Figure 3 with simple assertions, and improve the bounds for a2, q, and (q*a1), showing that all throughout the algorithm they remain less than or equal to max(a,b). We can form the proof with these improvements, and then prune it as usual to leave only the essential assertion results.

For the second step we continue on from the assertion results of the first step, though by using a different proof technique from any used until now. Our method will be to start at the very end of the function and work our way backward to form new logical assertions. We can observe that at the end of the first step’s results, the final values for x0, x1, y0, and y1 are all bounded in terms of the input parameters a and b. We can propagate those bounds backward for all variables and calculations in the algorithm, by working our way backward through the loop iterations.

For example, consider x1 at the end of the first step results. We can see that at the function end,
`abs(x1) <= b`
Reasoning backward to the final loop iteration and working from the end of that iteration, we see the reassignment x1=x2, so we can reason that prior to that reassignment, it must have been true that
`abs(x2) <= b`
A few lines earlier in the loop, we can see an already established assertion that
`abs(x1) <= abs(x2)`
So we know that at that point in the code
`abs(x1) <= abs(x2) <= b`
or more simply,
`abs(x1) <= b`
Given this knowledge, we can easily reason that at the start of the final loop iteration, the absolute value of x1 was bounded by b.

Continuing to work backward to the end of the second to last loop iteration (if the loop is taken at least twice), we can use the same reasoning to again bound x2 and x1 for that loop iteration. Continuing to work backward through all remaining iterations bounds x2 and x1 for all iterations.

This method is used in the final proof to bound all variables and calculations in terms of the input parameters a and b. Pruning this proof to leave only the basic algorithm and bounds assertions, we have our final desired result, reproduced here:

``````template <class T>
void extended_euclidean_final(const T a, const T b, T* pGcd, T* pX, T* pY)
{
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(std::numeric_limits<T>::is_signed, "");
assert(a >= 0 && b >= 0);   // precondition
T x0 = 1, y0 = 0, a0 = a;
T x1 = 0, y1 = 1, a1 = b;

while (a1 != 0) {
T q = a0/a1;
assert(0 <= q && q <= max(a,b));
T a2 = a0 - q*a1;
assert(0 <= (q*a1) && (q*a1) <= max(a,b));
assert(0 <= a2 && a2 < max(a,b));
if (a2 != 0) assert(q <= max(a,b)/2);
T x2 = x0 - q*x1;
assert(abs(q*x1) <= b);
assert(abs(x2) <= b);
if (a2 != 0) assert(abs(q*x1) <= max(1,b/2));
if (a2 != 0) assert(abs(x2) <= max(1,b/2));
T y2 = y0 - q*y1;
assert(abs(q*y1) <= max(1,a));
assert(abs(y2) <= max(1,a));
if (a2 != 0) assert(abs(q*y1) <= max(1,a/2));
if (a2 != 0) assert(abs(y2) <= max(1,a/2));

x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
assert(a0 == gcd(a,b));
assert(a*x0 + b*y0 == gcd(a,b));
*pX = x0;
*pY = y0;
*pGcd = a0;
}``````

Figure 4: Final Bounds in Terms of the Inputs, C++ Extended Euclidean Function.

Conclusion

Distilling the final proof results from Figure 4, we can see that all the variables and calculations in the Extended Euclidean algorithm are bounded to finite maximum and minimum values. More specifically, throughout every loop iteration, the variables and calculations of the algorithm (identically in Figure 1 and 4) have these following bounds:

`0 <= q <= max(a,b)`
`0 <= (q*a1) <= max(a,b)`
`0 <= a2 < max(a,b)`
`abs(q*x1) <= b`
`abs(x2) <= b`
`abs(q*y1) <= max(1,a)`
`abs(y2) <= max(1,a)`

And for every loop iteration other than the final iteration (when a2 == 0), a subset of those variables and calculations have these following even tighter bounds:

`0 <= q <= max(a,b)/2`
`abs(q*x1) <= max(1,b/2)`
`abs(x2) <= max(1,b/2)`
`abs(q*y1) <= max(1,a/2)`
`abs(y2) <= max(1,a/2)`

Sources

1. Donald Knuth, The Art of Computer Programming, Volume 2 (3rd ed.): Seminumerical Algorithms (Boston: Addison-Wesley Professional, 1997), p.335.
2. Donald Knuth, The Art of Computer Programming, Volume 1 (3rd ed.): Fundamental Algorithms (Redwood City: Addison-Wesley Professional, 1997), pp.13-15.
3. Donald Knuth, The Art of Computer Programming, Volume 2 (2nd ed.): Seminumerical Algorithms (Reading: Addison-Wesley, 1981), p.595 (answer 14).
4. George E. Collins, Computing multiplicative inverses in GF(p), Mathematics of Computation Vol. 23 (1969), pp. 197-200.
5. https://math.stackexchange.com/a/495457

Posted in Uncategorized | 4 Comments

Asymptotic Certainty (Good Enough)

This is something of a philosophical note, imagining we have unlimited time to “ensure” that we produce perfectly correct software that fulfills all requirements.  In practice, we work under real world constraints, with evolving requirements and schedules and specifications, and we’re accustomed to producing results that deliver best value, where perfect is the enemy of the good (value).  We work on a curve of time vs quality, where the curve asymptotically approaches 100% quality with enough time, and we don’t want to be too far to the right on that curve.  We get diminishing returns for something we’ll never reach.  I’ll argue that even in theory with unlimited time, we will always be working on an asymptotic curve, and thus provably correct results and absolute certainty will always be out of reach.

Tests to the rescue (or not)

Let’s imagine a super simple function called “increment” that takes as input an unsigned 8 bit integer value, and returns the value plus one (rolling over to zero if the input value is 255).  There’s nothing special or interesting about this hypothetical function, other than that the range of possible inputs is small enough that we can trivially test all possible inputs to verify that the function produces the correct return value in every case.  We can do exhaustive testing, knowing that we will leave absolutely no corner cases untested.

Unfortunately, an obvious problem we face is that our testing could have bugs.  We could test our tests (which isn’t a bad idea), but then to approach greater certainty we could I suppose also test our tests for our tests, and then… test the tests for our tests for our tests…  We’re back to working with asymptotic certainty again.  Let’s ignore a potential need to test our tests (etc etc).   Significant bugs can exist in exhaustively tested code due to undefined behavior, at least in C and C++ and perhaps other languages.  When a program invokes undefined behavior, one possible result is that the program executes in exactly the way the author planned and hoped, and all tests of course pass.  This is not a good thing, since weeks or months or years later, upon changing the compiler, or the hardware used for testing, or the time of day, the executable could begin to fail.  But let’s ignore undefined behavior – for some languages it’s not an issue.  We still will need to consider that a bug in the compiler or a fault in the hardware might allow a test to pass when it should fail.  As a real world story, one machine learning program found the optimal solution on a (defective) development/test board was to use a short circuit that existed in the substrate of the board.  All tests passed, so long as it ran on the one-of-a-kind board.  As for compiler bugs and operating system bugs and bugs in any software we rely upon, we again get pulled into working with asymptotic certainty and trust, because those tools are built by developers that fundamentally face the same issues we do.

Maybe formal proofs of correctness to the rescue?  By now it’s obviously hopeless, but nevertheless, proofs have the same sorts of problems as software.  A proof itself could contain an error (aka bug). Even if we use automated verification tools for proofs, unfortunately, the tool used for verification could have a bug.  We’ll always be dealing with levels of trust.  Complete certainty is out of reach, and “good enough” remains.  In the end, we get back to trying to deliver the best value.  Most all the details of development are a means to that end.

Google Test Projects in Visual Studio

This is a continuation of the blog entry on a quick start guide to setting up Google Test, this time specific to Visual Studio.

First Things First

If you used the quick start guide to set up Google Test, you may need to be aware of a bug in Visual Studio that occurs when using tests in static libraries or DLLs with Google Test. I’d recommend not using libraries for your tests, but if you do, Google provides some simple information on what you’ll need to know and how to work around the bug. If you don’t place your tests in libraries, then you won’t need to worry about any of this. I personally started out by putting my tests in libraries, since it would let me compose larger tests out of small test groups. However, I ultimately found it unnecessary since I could do the same thing with multiple test executables and CTest. Just keep in mind if you plan to create any tests in a static library (or DLL), the extra steps documented by google will probably be required in order for your tests to run.

Here’s how to add a Google Test project to the Visual Studio solution that you created when you used CMake in the quick start guide.

In Visual Studio, open the solution and add a new project of type Visual C++ > Windows Desktop > Windows Console Application or Static Library. In the new project’s properties, change the following settings:

1. Configuration Properties > C/C++ > General > “Additional Include Directories”:  Add
2. Configuration Properties > C/C++ > General > “Warning Level”:  Choose
Level4 (/W4)
3. Configuration Properties > C/C++ > General > “Treat Warnings as Errors”:  Choose
Yes
4. Configuration Properties > C/C++ > Code Generation > “Runtime Library”:  Depending on release/debug configuration, choose either
5. Configuration Properties > C/C++ > Precompiled Headers > “Precompiled Headers”:  Choose
6. Configuration Properties > C/C++ > Precompiled Headers > “Precompiled Header File”: Remove any filename, so that this setting is blank.

If the project you added is a Windows console application (executable):

Expand the project name in the Solution Explorer window, and right click on References, and choose “Add Reference…” and select gtest, gtest_main, and any other static library projects from the solution that this project will depend upon. [At the time of this writing, the Google Test CMake creates a “gtest” static library/project that contains all the essential google test library functions, and a “gtest_main” static library/project that contains the main() function.]

In this project’s properties, set the following:

1. Configuration Properties > Linker > General > “Link Library Dependencies”:  Choose
Yes
2. Configuration Properties > Linker > Optimization > “References”:  Choose
No (/OPT:NOREF)