Overflow and underflow in unsigned integers

2.6k views Asked by At

Suppose, I'm trying to subtract 2 unsigned integers:

247 = 1111 0111
135 = 1000 0111

If we subtract these 2 binary numbers we get = 0111 0000

Is this a underflow, since we only need 7 bits now?? Or how does that work??

3

There are 3 answers

0
Eric Postpischil On

Generally, “underflow” means the ideal mathematical result of a calculation is below what the type can represent. If 7 is subtracted from 5 in unsigned arithmetic, the ideal mathematical result would be −2, but an unsigned type cannot represent −2, so the operation underflows. Or, in an eight-bit signed type that can represent numbers from −128 to +127, subtracting 100 from −100 would ideally produce −200, but this cannot be represented in the type, so the operation underflows.

In C, unsigned arithmetic is said not to underflow or overflow because the C standard defines the operations to be performed using modulo arithmetic instead of real-number arithmetic. For example, with 32-bit unsigned arithmetic, subtracting 7 from 5 would produce 4,294,967,294 (in hexadecimal, FFFFFFFE16), because it has wrapped modulo 232 = 4,294,967,296. People may nonetheless use the terms “underflow” or “overflow” when discussing these operations, intended to refer to the mathematical issues rather than the defined C behavior.

In other words, for whatever type you are using for arithmetic there is some lower limit L and some upper limit U that the type can represent. If the ideal mathematical result of an operation is less than L, the operation underflows. If the ideal mathematical result of an operation is greater than U, the operation overflows. “Underflow” and “overflow” mean the operation has gone out of the bounds of the type. “Overflow” may also be used to refer to any exceeding of the bounds of the type, including in the low direction.

It does not mean that fewer bits are needed to represent the result. When 100001112 is subtracted from 111101112, the result, 011100002 = 11100002, is within bounds, so there is no overflow or underflow. The fact that it needs fewer bits to represent is irrelevant.

(Note: For integer arithmetic, “underflow” or “overflow” is defined relative to the absolute bounds L and U. For floating-point arithmetic, these terms have somewhat different meanings. They may be defined relative to the magnitude of the result, neglecting the sign, and they are defined relative to the finite non-zero range of the format. A floating-point format may be able to represent 0, then various finite non-zero numbers, then infinity. Certain results between 0 and the smallest non-zero number the format can represent are said to underflow even though they are technically inside the range of representable numbers, which is from 0 to infinity in magnitude. Similarly, certain results above the greatest representable finite number are said to overflow even though they are inside the representable range, since they are less than infinity.)

0
Kaz On

Underflow in unsigned subtraction c = a - b occurs whenever b is larger than a.

However, that's somewhat of a circular definition, because how many kinds of machines perform the a < b comparison is by subtracting the operands using wraparound arithmetic, and then detecting the overflow based on the two operands and the result.

Note also that in C we don't speak about "overflow", because there is no error condition: C unsigned integers provide that wraparound arithmetic that is commonly found in hardware.

So, given that we have the wraparound arithmetic, we can detect whether wraparound (or overflow, depending on point of view) has taken place in a subtraction.

What we need is the most significant bits from a, b and c. Let's call them A, B and C. From these, the overflow V is calculated like this:

A B C | V
------+--
0 0 0 | 0
0 0 1 | 1
0 1 0 | 1
0 1 1 | 1
1 0 0 | 0
1 0 1 | 0
1 1 0 | 0
1 1 1 | 1

This simplifies to

A'B + A'C + BC

In other words, overflow in the unsigned subtraction c = a - b happens whenever:

  1. the msb of a is 0 and that of b is 1;
  2. or the msb of a is 0 and that of c is 1;
  3. or the msb of b is 1 and that of c is also 1.

Subtracting 247 - 135 = 112 is clearly not overflow, since 247 is larger than 135. Applying the rules above, A = 1, B = 0 and C = 0. The 1 1 0 row of the table has a 0 in the V column: no overflow.

0
Dragoslav Marinica On

Long story short, this is what happens when you have:

unsigned char n = 255; /* highest possible value for an unsigned char */

n = n + 1; /* now n is "overflowing" (although the terminology is not correct) to 0 */
printf("overflow: 255 + 1 = %u\n", n);
   
n = n - 1; /* n will now "underflow" from 0 to 255; */
printf("underflow: 0 - 1 = %u\n", n);
   
n *= 2; /* n will now be (255 * 2) % 256 = 254; 
        /* when the result is too high, modulo with 2 to the power of 8 is used */ 
        /* for an 8 bit variable such as unsigned char; */
printf("large overflow: 255 * 2 = %u\n", n);   
   
n = n * (-2) + 100; /* n should now be -408 which is 104 in terms of unsigned char. */
                       /* (Logic is this: 408 % 256 = 152; 256 - 152 = 104) */
printf("large underflow: 255 * 2 = %u\n", n);

The result of that is (compiled with gcc 11.1, flags -Wall -Wextra -std=c99):

overflow: 255 + 1 = 0
underflow: 0 - 1 = 255
large overflow: 255 * 2 = 254
large underflow: 255 * 2 = 104

Now the scientific version: The comments above represent just a mathematical model of what is going on. To better understand what is actually happening, the following rules apply:

  1. Integer promotion:

Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int.

So what actually happens in memory when the computer does n = 255; n = n + 1; is this: First, the right side is evaluated as an int (signed), because the result fits in a signed int according to the rule of integer promotion. So the right side of the expression becomes in binary: 0b00000000000000000000000011111111 + 0b00000000000000000000000000000001 = 0b00000000000000000000000100000000 (a 32 bit int).

  1. Truncation

The 32-bit int loses the most significant 24 bits when being assigned back to an 8-bit number.

So, when 0b00000000000000000000000100000000 is assigned to variable n, which is an unsigned char, the 32-bit value is truncated to an 8-bit value (only the right-most 8 bits are copied) => n becomes 0b00000000.

The same thing happens for each operation. The expression on the right side evaluates to a signed int, than it is truncated to 8 bits.