As the previous post showed, it’s possible to correctly implement the Extended Euclidean Algorithm using one signed integral type for all input parameters, intermediate variables, and output variables. None of the calculations will overflow. The implementation was given as follows:

```
template <class T>
void extended_euclidean(const T a, const T b, T* pGcd, T* pX, T* pY)
{
static_assert(std::numeric_limits<T>::is_integer, "");
static_assert(std::numeric_limits<T>::is_signed, "");
assert(a >= 0 && b >= 0); // precondition
T x0=1, y0=0, a0=a;
T x1=0, y1=1, a1=b;
while (a1 != 0) {
T q = a0/a1;
T a2 = a0 - q*a1;
T x2 = x0 - q*x1;
T y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}
```

*Figure 1: C++ Extended Euclidean Function, Signed Integer Inputs.*

The precondition **a >= 0 && b >= 0** is required because the proof from the last post didn’t consider any cases where **a** or **b** was negative. I haven’t investigated whether the Extended Euclidean algorithm might be valid for any and all cases of negative inputs, so I won’t speculate about what might be possible for those cases. Instead let’s focus on the proven case **a >= 0 && b >= 0**.

It seems a bit wasteful to use signed integers for inputs **a** and **b** when the function precondition disallows **a** and **b** from ever being negative. Why not just use unsigned integer inputs that have the same precision as the signed integer inputs would have had? The one potential problem with doing this is that when we look at the bounds, it’s not quite as obvious as the all-signed case that overflow can’t occur. Displaying the bounds proven in the last post, throughout every loop iteration in Figure 1 we know the following:

`0 <= q <= max(a,b)`

`0 <= (q*a1) <= max(a,b)`

`0 <= a2 < max(a,b)`

`abs(q*x1) <= b`

`abs(x2) <= b`

`abs(q*y1) <= max(1,a)`

`abs(y2) <= max(1,a)`

Assuming we go ahead and change the type for inputs **a** and **b** to be unsigned, it should be immediately clear from the bounds and Figure 1 that in order to avoid overflow, **q**, **a2**, **a1**, **a0**, and ***pGcd** will all need to be changed to use the same unsigned integer type as the input parameters.

More problematic are **x2** and **y2** (and likewise **x1**, **y1**, **x0**, **y0**, ***pX**, and ***pY**). On the one hand, these variables need to remain signed types since the bounds above don’t restrict them from taking on negative values. But on the other hand, the bounds above allow them to be as large as **a** or **b**, and since **a** and **b** are now unsigned types, the largest possible value for **a** or **b** would overflow the corresponding signed type. Fortunately we can resolve this conflict, by using less generally applicable but tighter bounds for these variables. Let’s review the second set of proven bounds from the last post. For every loop iteration other than the final iteration (when **a2** == 0), we know the following tighter bounds hold:

`0 <= q <= max(a,b)/2`

`abs(q*x1) <= max(1,b/2)`

`abs(x2) <= max(1,b/2)`

`abs(q*y1) <= max(1,a/2)`

`abs(y2) <= max(1,a/2)`

Superficially, it seems like these bounds should be enough to guarantee we can use signed types for **x2** and **y2** without any problems. The only loop iteration where they could overflow is the final iteration, and their values don’t affect anything on the last iteration, so seemingly overflow (if it happened) would be fine at that point. But in truth it’s unacceptable, because overflow of a signed integral type is undefined behavior in C and C++. Upon overflow on the last iteration, a reasonable possibly is everything would be okay, but another reasonable possibility is a trap would occur that crashes the program. Anything is allowed after invoking undefined behavior, so in the end it’s hopeless to try to reason about whether the program would be okay. The only safe thing to do is to view any undefined behavior as a fatal bug that must be fixed.

Since we can not allow undefined behavior, we must ensure signed integer overflow is completely impossible in the function. Fortunately this is easy. We can simply test if we are in the final loop iteration (by checking if **a1** == 0), and break out if so. Most of the work in the final iteration is irrelevant since it’s never used later. The only exception is the reassignments (**x0=x1; y0=y1; a0=a1**), which still need to be done before causing a break.

Incorporating all the discussed changes, we get the following *potential* implementation:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x0=1, y0=0;
U a0=a;
S x1=0, y1=1;
U a1=b;
while (a1 != 0) {
U q = a0/a1;
U a2 = a0 - q*a1;
if (a2 == 0) {
x0=x1; y0=y1; a0=a1;
break;
}
S x2 = x0 - q*x1;
S y2 = y0 - q*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}
```

*Figure 2: Potential (But Broken) Implementation, Unsigned Integer Inputs.*

But there’s still a problem. We determined that **q** must be unsigned and **x1** must be signed, and Figure 2 reflects this. Therefore during the calculation in Figure 2 of **(q*x1)**, the compiler will implicitly convert **x1** to the unsigned type of **q** in order to carry out the calculation. Quoting from the C++17 standard section 8 [expr] (11.5.3):

*“Many binary operators that expect operands of arithmetic type cause conversions… The following rules shall be applied… if the operand that has unsigned integer type has rank equal to the rank of the type of the other operand, the operand with signed integer type shall be converted to the type of the operand with unsigned integer type.”*

[The C standard specifies very similar behavior.] The problem is we didn’t want to convert **x1** to unsigned. And even putting that aside, the conversion causes further conversions which can eventually result in implementation defined behavior. What we really wanted and intended is to convert **q** from unsigned to signed, which we can do with an explicit cast.

Explicitly casting **q** to signed integral type works well in the code of Figure 2. We know from the tight bounds given above that on every loop iteration except the final iteration,

`0 <= q <= max(a,b)/2`

Figure 2 has a break on the last loop iteration before it gets to the calculations (**q*x1** and **q*y1**) which would require type conversion of **q** to signed integral. So the conversions will never be performed on the final loop iteration, and therefore no conversion of **q** will ever overflow the signed result type.

Incorporating the explicit casts of **q**, we get the following implementation:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x0=1, y0=0;
U a0=a;
S x1=0, y1=1;
U a1=b;
while (a1 != 0) {
U q = a0/a1;
U a2 = a0 - q*a1;
if (a2 == 0) {
x0=x1; y0=y1; a0=a1;
break;
}
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1; y0=y1; a0=a1;
x1=x2; y1=y2; a1=a2;
}
*pX = x0;
*pY = y0;
*pGcd = a0;
}
```

*Figure 3: Correct Implementation, Unsigned Integer Inputs.*

This implementation in Figure 3 should be correct, but the conditional branch inside the loop isn’t ideal from a performance standpoint. It would be nice if we could remove the conditional branch. Indeed we can, by re-arranging the code.

## Optimized Implementation

We’ll remove the conditional branch inside the loop of Figure 3 by rotating statements from the bottom of the loop to the top of the loop. Each rotation will require changing the code that precedes and follows the loop (and sometimes the loop condition too), so that the meaning of the function stays the same. Aside from the rotation, nothing inside the loop should change. The end goal of the rotations is to get the position of the nested conditional branch to be at the end of the loop, so that we can merge it into the loop condition.

Let’s first rotate the reassignments from the end of the loop in Figure 3 to the top of the loop. [If this description is hard to follow, it may help to compare Figure 3 with Figure 4.] The rotation affects the code after the loop, since **x0**, **y0**, and **a0** won’t be reassigned upon the end of the final loop iteration. So to compensate, ***pX**, ***pY**, and ***pGcd** will now need to be set to the values of **x1**, **y1**, and **a1**. This change to ***pX**, ***pY**, and ***pGcd** will need to also be valid if the loop is never taken at all, so we’ll need to change the initial values of **x1**, **y1**, and **a1** to the initial values in Figure 3 given to **x0**, **y0**, and **a0**. Fortunately, since we will have already done the rotation that places the loop reassignments at the top of the loop, this last change also ensures the correct values will get reassigned upon the first entry to the loop. At least in part. We haven’t discussed a second part of the effect of having the reassignments at the top of the loop: the reassignments use **x2**, **y2**, and **a2**, which are variables that haven’t been declared or initialized yet. We’ll fix this by adding declarations and initializations to the start of the function for **x2**, **y2**, and **a2**, setting them to the initial values that had formerly been given to **x1**, **y1**, and **a1** in Figure 3. As a result, upon first entry of the loop, the reassignments at the top of the loop will re-set the correct initial values, leaving the meaning of the function unchanged.

There’s one last detail. We’ll need to change the loop condition to refer to **a2** instead of **a1**, again due to the rotation of the reassignment statements. Since we’ll have already declared and initialized **a2** at the top of the function (to the initial value formerly used in Figure 3 by **a1**), this preserve the meaning of the function perfectly.

It’s easiest to see all this in actual code. By rotating the loop’s reassignments in Figure 3, we’ll have changed the implementation to the following code:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x2=0, y2=1;
U a2=b;
while (a2 != 0) {
S x0=x1;
S y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;
U q = a0/a1;
a2 = a0 - q*a1;
if (a2 == 0)
break;
x2 = x0 - static_cast<S>(q)*x1;
y2 = y0 - static_cast<S>(q)*y1;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}
```

*Figure 4: Optimization Step 1, Unsigned Integer Inputs.*

We haven’t yet removed the conditional branch inside the loop, but we’re closer to the goal. We still need to rotate the calculations of **x2** and **y2** from the bottom of the loop to the top, and again fix the code before and after the loop to preserve the meaning of the function. As it turns out, there’s no need to change any code after the function. But now that the calculations are at the top of the loop, the variable **q** gets used before it has been declared or initialized. In order to preserve the meaning of the function, we’ll need to add a declaration and initialization of **q** at the start of the function. Initializing **q** to zero works quite well. On the first entry of the loop, the calculations of **x2** and **y2** will thereby contain a multiplication by zero (since **q** == 0), effectively reducing those calculations on the first iteration to reassignments from **x0** and **y0**. This means we’ll need to add to the start of the function declarations and initializations for **x0** and **y0**, setting them to the initial values that Figure 4 had used for **x2** and **y2**.

After making these changes to Figure 4, we have the following code:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x0=0, y0=1;
U a2=b;
U q=0;
while (a2 != 0) {
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1;
y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;
q = a0/a1;
a2 = a0 - q*a1;
if (a2 == 0)
break;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}
```

*Figure 5: Optimization Step 2, Unsigned Integer Inputs.*

We can now get rid of the conditional branch at the end of the function, since it’s redundant. Making this change and cleaning up the code a little, we get the following optimized function for the Extended Euclidean Algorithm with unsigned integer inputs:

```
template <class S, class U>
void extended_euclidean(const U a, const U b, U* pGcd, S* pX, S* pY)
{
static_assert(std::numeric_limits<S>::is_integer, "");
static_assert(std::numeric_limits<S>::is_signed, "");
static_assert(std::numeric_limits<U>::is_integer, "");
static_assert(!(std::numeric_limits<U>::is_signed), "");
static_assert(std::is_same<std::make_signed<U>::type, S>::value, "");
S x1=1, y1=0;
U a1=a;
S x0=0, y0=1;
U a2=b, q=0;
while (a2 != 0) {
S x2 = x0 - static_cast<S>(q)*x1;
S y2 = y0 - static_cast<S>(q)*y1;
x0=x1; y0=y1;
U a0=a1;
x1=x2; y1=y2; a1=a2;
q = a0/a1;
a2 = a0 - q*a1;
}
*pX = x1;
*pY = y1;
*pGcd = a1;
}
```

*Figure 6: Optimized Implementation, Unsigned Integer Inputs.*

## Conclusion

The Extended Euclidean algorithm can be correctly and very efficiently implemented using unsigned integer inputs and signed integer outputs, so long as all the function parameters have the same precision. Figure 6 provides the optimized implementation. There is a header file with the complete implementation, and a CMake file that will create a test project for the implementation if you wish to run it.