Shifting with floats

361 views Asked by At
float a = 1.0 + ((float) (1 << 25))
float b = 1.0 + ((float) (1 << 26))
float c = 1.0 + ((float) (1 << 27))

What are the float values of a, b, and c after running this code? Explain why the bit layout of a, b, and c causes each value to be what it is.

1

There are 1 answers

0
chux - Reinstate Monica On

What are the float values of a, b, and c after running this code?

When int is 32-bits, the below integer shifts are well defined and exact. Code is not shifting a float @EOF.

// OK with 32-bit int
1 << 25
1 << 26
1 << 27

Casts to float, the above power-of-2 values, are also well defined with no precision loss.

// OK and exact
(float) (1 << 25)
(float) (1 << 26)
(float) (1 << 27)

Adding to those to a double 1.0 are well defined exact sums. A typical double has a 53 bit significand and can represent 0x8000001.0p0 exactly. e.g.: DBL_MANT_DIG == 53

// Let us use hexadecimal FP notation
1.0 + ((float) (1 << 25))  // 0x2000001.0p0 or 0x1.0000008p+25
1.0 + ((float) (1 << 26))  // 0x4000001.0p0 or 0x1.0000004p+26
1.0 + ((float) (1 << 27))  // 0x8000001.0p0 or 0x1.0000002p+27

Finally code attempts to assign double values to a float, while within the range of a typical float encoding, cannot represent the values exactly.

A typical float has a 24 bit significand. e.g.: FLT_MANT_DIG == 24

If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr ยง 6.3.1.4 2.

A typical implementation-defined manner rounds to nearest, ties to even.

  float s = 0x0800001.0p0; printf("%a\n", s);
  float t = 0x1000001.0p0; printf("%a\n", t);// 0x1000001.0p0 1/2 way between two floats 
  float a = 0x2000001.0p0; printf("%a\n", a);
  float b = 0x4000001.0p0; printf("%a\n", b);
  float c = 0x8000001.0p0; printf("%a\n", c);

Output

0x1.000002p+23   // exact conversion double to float
0x1p+24          
0x1p+25
0x1p+26
0x1p+27

Explain why the bit layout of a, b, and c causes each value to be what it is.

The bit layout is not the issue. It is the property of the float with FLT_MANT_DIG == 24, a 24-bit significand and implementation defined behavior, that results in the rounding of the double value to the nearby float one. Any float layout with FLT_MANT_DIG == 24 would have like results.