As the previous post showed, it’s possible to correctly implement the Extended Euclidean Algorithm using one signed integral type for all input parameters, intermediate variables, and output variables. None of the calculations will overflow. The implementation was given as follows:

```
template <class T>
void extended_euclidean(const T a, const T b, T* pGcd, T* pX, T* pY)
{
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(std::numeric_limits<T>::is_signed, "");
assert(a >= 0 && b >= 0); // precondition
T x0=1, y0=0, a0=a;
T x1=0, y1=1, a1=b;
while (a1 != 0) {
T q = a0/a1;
T a2 = a0 - q*a1;
T x2 = x0 - q*x1;
T y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}
```

*Figure 1: C++ Extended Euclidean Function, Signed Integer Inputs.*

The precondition **a >= 0 && b >= 0** is required because the proof from the last post didn’t consider any cases where **a** or **b** was negative. I haven’t investigated whether the Extended Euclidean algorithm might be valid for any and all cases of negative inputs, so I won’t speculate about what might be possible for those cases. Instead let’s focus on the proven case **a >= 0 && b >= 0**.

It seems a bit wasteful to use signed integers for inputs **a** and **b** when the function precondition disallows them from ever being negative. Why not just use unsigned integer inputs that have the same precision as the signed integer inputs would have had, and double the permissible input range? If we need more persuasion, pretty likely our primary application for the Extended Euclidean Algorithm is to implement the modular multiplicative inverse. It’s common to use unsigned integral types for modular arithmetic. Why not increase compatibility by keeping to that convention?

It turns out it’s very much possible to correctly implement a modified version of the algorithm using unsigned inputs. You can skip directly to the implementation code or to the optimized version of that code if you like.

## Adapting the Algorithm to Use Unsigned Inputs

Changing the inputs in Figure 1 to unsigned types seems like it might be dangerous – we might get an overflow in one of the variables or calculations inside the function. The last post proved bounds for those variables and calculations, and for an all-signed implementation it nicely ruled out overflow. Those bounds will guide us now in changing Figure 1 to safely use unsigned inputs.

From the bounds proven in the last post, we know that for every loop iteration in Figure 1,

`0 <= q <= max(a,b)`

`0 <= a2 < max(a,b)`

Assuming we go ahead and change the type for inputs **a** and **b** to be unsigned, the bounds above make it clear that in order to be certain to avoid overflow we need to change **q** and **a2** to also use that same unsigned type. Looking at Figure 1, we can also see that the contents of **a2** propagate to **a1**, **a0**, and ***pGcd**. As a result, we need to change all those variables to use that same unsigned integer type.

The more problematic variables are **x2** and **y2** (and likewise **x1**, **y1**, **x0**, **y0**, ***pX**, and ***pY**, since the values of **x2** and **y2** propagate to them).

Displaying more of the bounds proven in the last post, throughout every loop iteration in Figure 1 we know that

`abs(x2) <= b`

`abs(y2) <= max(1,a)`

On the one hand, we can see that **x2** and **y2** must remain signed types, since the bounds above don’t prevent them from taking on negative values. But on the other hand, the bounds above allow them to become as large as **a** or **b**, and because **a** and **b** are now an unsigned type, the largest possible value for **a** or **b** would overflow the corresponding signed type (which we determined **x2** and **y2** must use). Fortunately we can resolve this conflict, by using less generally applicable but tighter bounds for these variables. Let’s refer to the second set of proven bounds from the last post. For every loop iteration other than the final iteration (when **a2** == 0), we know the following tighter bounds hold:

`abs(x2) <= max(1,b/2)`

`abs(y2) <= max(1,a/2)`

Superficially, it seems like these bounds should be enough to guarantee we can use signed types for **x2** and **y2** without any problems, since **b/2** and **a/2** will always fit in their corresponding signed type. The only loop iteration not covered by these bounds is the final iteration, and in Figure 1 we can see that the calculated values of **x2** and **y2** on that last iteration are never used anywhere. So seemingly overflow (if it happened) would be fine at that point. But in truth it’s unacceptable, because overflow of a signed integral type is undefined behavior in C and C++. Upon overflow on the last iteration, a reasonable possibly is everything would be okay, but another reasonable possibility is a trap would occur that crashes the program. Anything is allowed after invoking undefined behavior, so in the end it’s hopeless to try to reason about whether the program would be okay. The only safe thing to do is to view any undefined behavior as a fatal bug that must be fixed.

Since we can not allow undefined behavior, we must ensure signed integer overflow is completely impossible in the function. Fortunately this is easy – we can simply test if we are in the final loop iteration, through an early check of the loop conditional, and break out if so. Most of the work in the final iteration is irrelevant since it’s never used later. The only exception is the reassignments (**x0=x1; y0=y1; a0=a1**), which still need to be done before causing a break.

We’re not quite finished, but let’s incorporate all the changes discussed so far, so that we have an *in-progress* implementation:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x0=1, y0=0;
U a0=a;
S x1=0, y1=1;
U a1=b;
while (a1 != 0) {
U q = a0/a1;
U a2 = a0 - q*a1;
if (a2 == 0) {
x0=x1; y0=y1; a0=a1;
break;
}
S x2 = x0 - q*x1;
S y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}
```

*Figure 2: In-Progress (And Broken) Implementation, Unsigned Integer Inputs.*

Figure 2 has a hidden problem that we’ll need to fix. We determined above that **q** must be unsigned and **x1** must be signed, and indeed Figure 2 reflects this. As a result, during the calculation of **(q*x1)**, the compiler will implicitly convert **x1** to the unsigned type of **q** in order to carry out the calculation. Quoting from the C++17 standard section 8 [expr] (11.5.3):

*“Many binary operators that expect operands of arithmetic type cause conversions… The following rules shall be applied… if the operand that has unsigned integer type has rank equal to the rank of the type of the other operand, the operand with signed integer type shall be converted to the type of the operand with unsigned integer type.”*

[The C standard specifies very similar behavior.]

This is a problem – **x1** may legitimately hold a negative value, and yet the compiler will be implicitly converting it to an unsigned integer type! Converting a negative integer value to an unsigned type is well defined in the C and C++ standards, but it rarely does what a programmer wants when it’s unintentional. This conversion certainly wasn’t planned in Figure 2. For the multiplication **(q*x1)**, what we really wanted was to convert the unsigned **q** variable to signed, and to leave **x1** signed. Fortunately we can make this happen without much trouble by explicitly casting **q** to signed: for example, **static_cast<S>(q)**. This explicit cast will also ensure that the compiler will perform no implicit conversions on **x1**, leaving it as signed, since the types of **q** (after the cast) and **x1** will match.

The exact same problem we just discussed exists in Figure 2 for the interim calculation of **(q*y1)** too, and can be solved again with an explicit cast. No casts or changes are needed for the interim calculation **(q*a1)** – it has no problem with conversions since **q** and **a1** both have the same unsigned type.

Explicitly casting **q** to its corresponding signed integral type works well in the code of Figure 2. We know from the bounds proven in the last post that on every loop iteration except the final iteration,

`0 <= q <= max(a,b)/2`

It follows that for all loop iterations other than the final iteration, **q** will be less than or equal to the largest possible unsigned value divided by two. Thus, for those iterations, the value of **q** can be represented in the corresponding signed type, which means the explicit cast of **q** to that signed type will not overflow. On the final iteration we can see that Figure 2 has a break which occurs before it gets to the only interim calculations (**q*x1** and **q*y1**) which require the explicit cast of **q**. And so no casts of **q** will take place on the final loop iteration. Thus none of the casts of **q** will ever overflow the signed result type on any loop iteration.

Let’s incorporate the explicit casts of **q**, which will fix the code from Figure 2.

## A Correct Implementation

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x0=1, y0=0;
U a0=a;
S x1=0, y1=1;
U a1=b;
while (a1 != 0) {
U q = a0/a1;
U a2 = a0 - q*a1;
if (a2 == 0) {
x0=x1; y0=y1; a0=a1;
break;
}
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}
```

*Figure 3: Correct Implementation, Unsigned Integer Inputs.*

At this point, we have correct code in Figure 3. The remainder of this section proves that none of the interim calculations in Figure 3 can overflow, which finishes the code analysis and lets us optimize.

The interim calculations are **q*a1**, **q*x1**, and **q*y1**. Using the proven bounds from the last post, we can see that for every loop iteration the following bound holds:

`0 <= (q*a1) <= max(a,b)`

And for every loop iteration other than the final iteration (when **a2** == 0), the following bounds hold:

`abs(q*x1) <= max(1,b/2)`

`abs(q*y1) <= max(1,a/2)`

If the result of **q*a1** is representable in unsigned type U, then we know it does not overflow. The bound **0 <= (q*a1) <= max(a,b)** shows that this requirement is always satisfied, because **max(a,b)** is less than or equal to **std::numeric_limits<U>::max()**.

Similarly, if the result of **static_cast<S>(q)*x1** is representable in signed type S, then it does not overflow. We’ll need to analyze two cases: the case where this result is non-negative, and the case where it is negative. Let’s first define the constants **u_max = std::numeric_limits<U>::max()**, **s_max = std::numeric_limits<S>::max()**, and **s_min = std::numeric_limits<S>::min()**. The C/C++ standards effectively mandate that **u_max/2 <= s_max** and **-u_max/2 >= s_min**.

Let’s consider first the case where **0 <= static_cast<S>(q)*x1**. Repeating the bound given above, we know that for all loop iterations other than the final,

`abs(q*x1) <= max(1,b/2)`

For this case we can remove the absolute value from the bound, so that we have

`0 <= static_cast<S>(q)*x1 <= max(1,b/2)`

The input parameter **b** has unsigned type U, so we can expand this to

`0 <= static_cast<S>(q)*x1 <= max(1,b/2) <= u_max/2 <= s_max`

or in simplified form,

`0 <= static_cast<S>(q)*x1 <= s_max`

which shows us that for this case, **static_cast<S>(q)*x1** will always be representable in type S, provided that we are not in the final loop iteration.

Let’s now consider the other case, where **0 > static_cast<S>(q)*x1**. We can remove the absolute value from the bound, giving us for this case:

`-static_cast<S>(q)*x1 <= max(1,b/2)`

and thus

`0 > static_cast<S>(q)*x1 >= -max(1,b/2)`

The input parameter **b** has unsigned type U, so we can expand this to

`0 > static_cast<S>(q)*x1 >= -max(1,b/2) >= -u_max/2 >= s_min`

which shows us that for this case, **static_cast<S>(q)*x1** will always be representable in type S, provided that we are not in the final loop iteration.

Since in both cases **static_cast<S>(q)*x1** is always representable in type S, we know that this multiplication will not overflow, so long as we are not in the final loop iteration. If we are in the final loop iteration, Figure 3 shows that we will break out of the loop before reaching this calculation, and so no problems will be possible on the final loop iteration. The same reasoning as above can be applied to show that no overflow will occur in Figure 3 for the multiplication of **static_cast<S>(q)*y1**.

We’ve shown that none of the calculations or conversions in Figure 3 will overflow, and we can conclude that the implementation in Figure 3 is probably correct. You can see the Conclusion for a test project if you wish to run it.

The one fairly minor complaint we might have with Figure 3 is that the conditional branch inside the loop isn’t ideal from a performance standpoint. It would be nice if we could remove the conditional branch. Indeed we can, by re-arranging the code.

## Optimizing the Code

We can remove the conditional branch inside the loop of Figure 3 by rotating statements from the bottom of the loop to the top of the loop. Each rotation will require changing the code that precedes and follows the loop (and sometimes the loop condition too), so that the meaning of the function stays the same. Aside from the rotation, nothing inside the loop should change. The end goal of the rotations is to get the position of the nested conditional branch to be at the end of the loop, so that we can merge it into the loop condition.

Let’s first rotate the reassignments from the end of the loop in Figure 3 to the top of the loop. [If this description is hard to follow, it may help to compare Figure 3 with Figure 4.] The rotation affects the code after the loop, since **x0**, **y0**, and **a0** won’t be reassigned upon the end of the final loop iteration. So to compensate, ***pX**, ***pY**, and ***pGcd** will now need to be set to the values of **x1**, **y1**, and **a1**. This change to ***pX**, ***pY**, and ***pGcd** will need to also be valid if the loop is never taken at all, so we’ll need to change the initial values of **x1**, **y1**, and **a1** to the initial values in Figure 3 given to **x0**, **y0**, and **a0**. Fortunately, since we will have already done the rotation that places the loop reassignments at the top of the loop, this last change also ensures the correct values will get reassigned upon the first entry to the loop. At least in part. We haven’t discussed a second part of the effect of having the reassignments at the top of the loop: the reassignments use **x2**, **y2**, and **a2**, which are variables that haven’t been declared or initialized yet. We’ll fix this by adding declarations and initializations to the start of the function for **x2**, **y2**, and **a2**, setting them to the initial values that had formerly been given to **x1**, **y1**, and **a1** in Figure 3. As a result, upon first entry of the loop, the reassignments at the top of the loop will re-set the correct initial values, leaving the meaning of the function unchanged.

There’s one last detail. We’ll need to change the loop condition to refer to **a2** instead of **a1**, again due to the rotation of the reassignment statements. Since we’ll have already declared and initialized **a2** at the top of the function (to the initial value formerly used in Figure 3 by **a1**), this preserve the meaning of the function perfectly.

It’s easiest to see all this in actual code. By rotating the loop’s reassignments in Figure 3, we’ll have changed the implementation to the following code:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x2=0, y2=1;
U a2=b;
while (a2 != 0) {
S x0=x1;
S y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;
U q = a0/a1;
a2 = a0 - q*a1;
if (a2 == 0)
break;
x2 = x0 - static_cast<S>(q)*x1;
y2 = y0 - static_cast<S>(q)*y1;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}
```

*Figure 4: Optimization Step 1, Unsigned Integer Inputs.*

We haven’t yet removed the conditional branch inside the loop, but we’re closer to the goal. We still need to rotate the calculations of **x2** and **y2** from the bottom of the loop to the top, and again fix the code before and after the loop to preserve the meaning of the function. As it turns out, after rotating the calculations of **x2** and **y2** there’s no need to change any code after the function. But now that the calculations are at the top of the loop, the variable **q** gets used before it has been declared or initialized. In order to preserve the meaning of the function, we’ll need to add a declaration and initialization of **q** at the start of the function. Initializing **q** to zero works quite well. On the first entry of the loop, the calculations of **x2** and **y2** will thereby contain a multiplication by zero (since **q** == 0), effectively reducing those calculations on the first iteration to reassignments from **x0** and **y0**. This means we’ll need to add to the start of the function declarations and initializations for **x0** and **y0**, setting them to the initial values that Figure 4 had used for **x2** and **y2**.

After making these changes to Figure 4, we have the following code:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x0=0, y0=1;
U a2=b;
U q=0;
while (a2 != 0) {
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1;
y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;
q = a0/a1;
a2 = a0 - q*a1;
if (a2 == 0)
break;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}
```

*Figure 5: Optimization Step 2, Unsigned Integer Inputs.*

We can now get rid of the conditional branch at the end of the function, since it’s redundant. Making this change and cleaning up the code a little, we get the following optimized function for the Extended Euclidean Algorithm with unsigned integer inputs:

## An Optimized Implementation

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x0=0, y0=1;
U a2=b, q=0;
while (a2 != 0) {
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1; y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;
q = a0/a1;
a2 = a0 - q*a1;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}
```

*Figure 6: Optimized Implementation, Unsigned Integer Inputs.*

## Conclusion

The Extended Euclidean algorithm can be correctly and very efficiently implemented using unsigned integer inputs and signed integer outputs, so long as all the function parameters have the same precision. Figure 3 provides the basic implementation and Figure 6 provides the optimized implementation. There is a header file with the complete implementation, and a CMake file that will create a test project for the implementation if you wish to run it.