Casting between int, float and double in C

5.9k views Asked by At

I don't really understand casting in C. Can anyone help me with a question in the book Computer Systems: A Programmer's Perspective:


We generate arbitrary integer values x, y, and z, and convert them to values of type double as follows:

int x = random();
int y = random();
int z = random();

double dx = (double) x;
double dy = (double) y;
double dz = (double) z;

For each of the following C expressions, you are to indicate whether or not the expression always yields 1. If it always yields 1, describe the underlying mathematical principles. Otherwise, give an example of arguments that make it yields 0

A. (float) x == (float) dx
B. dx - dy == (double) (x-y)
C. (dx + dy) + dz == dx + (dy + dz)
D. (dx * dy) * dz == dx * (dy * dz)
E. dx / dx == dz / dz
3

There are 3 answers

1
sopho-saksham On

Since you are just converting x,y,z to their double form , dx,dy,dz are nothing but x,y,z having some number of zeroes after a decimal. Depending upon compiler, it may happen that they are added with some other digit than zero. In that case , equality will not hold. For example : x=5 and dx=5.000000 but it may happen dx is made 5.000001 because it is still almost equivalent to x. I also advise to not go in such pointless topic of equality of floating point numbers.

0
old_timer On

What casting is doing is converting from one thing to another. It might be an 8 bit integer to a 16 bit or 32 bit

unsigned int x;
unsigned char y;

y = 7;
x = (unsigned int) y;

which is implied of course if you did x=y; assuming 8 and 32 bit typedefs here. goes from the bit pattern 0x07 to 0x00000007.

This is actually a really fun exercise. Lets make up a format and then ponder the questions.

Floating point formats generally do something like this and we can think purely in positive numbers and work most of this exercise. Take the number 1234 in base 10, we talk about scientific notation in grade school that would be 1.234 times ten to the 3 power. base two computer formats work the same way there is a reason it is called FLOATING point. the decimal moves around in a virtual sense. So the number nine 0b1001 in float you would want to find the most significant one and put the decmal after it 1.001 times 2 to the power 3. the number seventeen 1.0001 times 2 to the power 4. Floating point has to cover the sign, it has to have some number of bits for the "fraction" or mantissa since we are forcing a fraction even with whole integer numbers. and then an exponent. we dont necessarily need to preserve the one before the decimal point that can be assumed in some formats. of course zero is a special problem and the floating point format needs a special exception or pattern for that. the format also needs some number of bits to represent the power of two that is applied. for starters lets assume only positive integers are being generated and our integers are 8 bits so 0 to 0xFF is our entire world of integers. Our double format has 12 fraction bits, our single has 5 for sake of argument.

so what is our worst case number 0xFF which we can easily represent with 12 fraction bits 1.111111100000 times 2 to the power 7 and lets assume that our format has more than enough exponent bits to cover this entire exercise.

so our double can hold every one of our integers, in C

dx = double(x);

just means convert formats if we started with the bit pattern 00001001 which is the number 9 and in our made up double that would be 1.001000000000 times 2 to the power 3 the power being some more bits we store in our format but not relevant to this question.

and in our made up single that number 9 is 1.00100 times 2 to the power 3.

but the number 0xFF in our single precision is 1.11111 times 2 to the power 7 when we convert that back to an integer it is 0xFC not 0xFF we lost some bits in the conversion. assuming no rounding. in base 10 1/27 is 0.0307307307...if we were to cut that off at 4 digits we would have 0.0307 which is a little bit low. but if we took 3 0.031 with rounding that is a little bit too high. if we look at 0xFF in our single it is 1.11111 with the next two bits being tossed are 11 which is more than a half so what if we rounded up 10.00000 times 2 to the 7th normalizes to 1.00000 times 2 to the 8th power. 0xFF rounded to 0x100 basically. we could go with 0x100 representing 0xff or 0xFC representing 0xFF a little high or a little low but not exact, when we convert back we are either a little high or a little low. that is exactly what is going on when you do these integer to float and back conversions.

so look at the first case A (float)x vs (float)((double)x) 1.00100 times 2 to the power 3 vs 1.001000000000 times 2 to the power 3. covering float x vs double x then the double has to be converted, is the same clipping and rounding used between floating point formats as integer to float? depends on the hardware but one would hope that 11111111 converts the same as 1.111111100000 but maybe it doesnt, in an ideal world it would.

C is an interesting case and will say it this way, how many bits does it take to represent a two bit number plus a two bit number? positive numbers worst case 3 + 3 = 6 how many bits scale that up 0xFF plus 0xFF now many bits is it more than 12 in our double format? take 0xFF plus 0xFF plus 0xFF how many bits does that take? is it more than 12? does re-arranging the grouping change that?

D two bit numbers 3*3 = 9 two bits in for each operand how many bits out? 0xFF times 0xFF? then 0xFF times 0xFF times 0xFF does that take more than 12 bits? Is the first question. Second question if so then how does the clipping and rounding work? Is the clipping and rounding affected by the grouping. That is the brain bender so far, I may have to write a simple program to understand it.

And E is not as nasty as I first thought, re-reading it. I am dividing the bit pattern with the exact bit pattern. What is our biggest problem with division not just computers but in general, what integer gives us a problem and do any others? what if we allow signed numbers in here positive x divided by positive x and negative z divided by negative z?

So search for double precision floating point at wikipedia then search for single precision. how many "fraction" bits are there for double? and assuming a 32 bit or 31 and a sign for an int do all the significant digits fit? thinking positive numbers we would have a power of 2 to the 31st now many exponent bits does it take to represent the number 31? is there enough? what about single can you hold 31 or 30 significant bits in the fraction? can you represent a power of plus or minus 31 in the exponent?

EDIT

so thinking about case D I wrote a program. I used 8 bits of fraction

//5 9 f 020140 030120 0301E0 090151 090152
(5 * 9) * 15 vs 5 * (9 * 15)

so 5 is 1.01000000 or 0x1.40, 9 is 1.00100000 or 0x1.20 and 0xF is 1.11100000 or 0x1.E0.

Multiplication in float is not the same as you would think, makes your brain hurt a little because we are not multiplying 5 times 9 directly we have shifted everything so that it is 1.something so it is instead

0x120 * 0x140 = 0x16800

and because of that normalization you chop off 8 bits and this is where I rounded as we will see, in this case the result is 0x168 and no normalization is needed

5*9 = 0x2D = 101101
1.01101000 0x1.68

I dont need to care about exponents but they just add 5 has an exponent of 2, 9 an exponent of 8 so the result is 0x168 with an exponent of 5 so 0x168 times 0x1E0 = 0x2A300 we instantly chop off 8 bits 0x2A3 due to the nature of multiplication. now we normalize we need 1.soemthing not 2.something so we shift right and increase the exponent so the exponent was 5+3=8 we give it one more 9 but notice something 0x2A3 0x1010100011 we are going to throw away a bit of precision instead of 0x1.51 and a half of a bit we have 0x1.51 the nature of base 2 floating point. Now is that supposed to round up the answer? perhaps. if so then the answer 0x1.52

Take it the other way 5*(9*15)
0x120*0x1E0 = 0x21C00
or 0x21C  1000011100
0x10E
0x140*0x10E = 0x15180
we are going to lose a bit here
is it 0x151 or 0x152?

And are these equivalent rounding questions? Do both paths result in 0x152 making them equal or is one view of chopping off bits different than the other? If we dont round at all and just clip both answers are 0x152.

3 11 1f 010180 040110 0401F0 0A018B 0A018A
(3*17)*31 vs 3*(17*31)  no rounding just clipping
(3*17)*31
0x180*0x110 = 0x19800
0x198
0x198*0x1F0 = 0x31680  0x316 0x1100010110
0x18B
3*(17*31)
0x110*0x1F0 = 0x20f00
0x107
0x180*0x107 = 0x18A80
0x18A
0x18B != 0x18A

one path we clipped off two bits the other path just one. Was that fair? the 0x31680 taken as a whole

110001011010000000
110001011 010000000 with discarded bits on the right

so looking at that that way 01 or 010 or 0100 is less than half that doesnt round up any more than a 3 or a 2 rounds up 1.33 does not round to 1.4 in base 10.

but 0x20F00

100000111100000000
100000111 100000000

that is right at the half way point 1/10, 10/100, 100/1000 that is a half.

should that have been 0x108?

0x180*0x108 = 0x18C00
0x18C
0x18C != 0x18B

so viewing rounding that way, it still does not match ordering made a difference.

Maybe you think I am doing rounding wrong and that is fair, and if so does that make all the possible integer patterns work? assuming an int is 32 bits and double is IEEE754 with 52 bits of mantissa we are going to overflow that and have to chop off bits so chopping and rounding will happen, does ordering matter?

0
chux - Reinstate Monica On

C is a very portable language to many platforms - all sorts of processors and computers. The results of the 5 equality tests vary from platform to platform. Let us look to the C specification.


int x = random(); is like saying for each of the following tests: "How do the results behave for every int?". Let us look closer.

random() is not a standard C library function. The name "random" hints that the function result is of the entire int range: [INT_MIN...INT_MAX] or [0...INT_MAX]. Go with either one.

=====================================================================

int
C specifies that the minimum range of int is -(215-1) to -(215-1) or -32,767 to 32,767.
Commonly, certainly not all, platforms use the range -(231) to -(231-1) or -2,147,483,648 to 2,147,483,647.
Some use the range -(263) to (263-1) or about -9*1018 to 9*1018.
C does not specify the maximum size of the range.

double
C specifies that the minimum range of double is -1037 to 1037.
Commonly, certainly not all, platforms use the range about -1.8*10308 to 1.8*10308.
C does not specify the maximum size of the range.

double is floating point. It represents number up to a certain precision. Large whole numbers of the range are not all representable.

C specifies that the minimum range of exactly representable whole numbers double is -109 to 109.
Commonly, this range of exactly representable whole numbers is about -9*1015 to 9*1015.

The IEEE tag hints that binary64 is used. We could go with that, but the FP selection does not change the results much as it is the width int range versus the double's exactly representable whole numbers that is the key issue. either one may be wider than the other. int is commonly narrower.


With double dx = (double) x, the conversion on many platforms is exact. It is not exact on rare platforms where the range of int is wider than the range of a double's exactly representable whole numbers.

With (float) x, the conversion is commonly inexact on many platforms over the range of int, as float has not more range than double and is commonly narrower.

When an inexact conversion to int to float or double occurs, the result is normally to the closest possible result.


(float) x == (float) dx can be false on platforms whose ranges of floating point exactly representable whole numbers of float and double differ and both are a sub-range of int. Likely INT_MAX will fail. With binary64 and 32 bit int (OP's case?), this equality should always be true.

dx - dy == (double) (x-y) can be false for the same reason as above and an additional one: In C, int overflow is not defined, so INT_MIN - 1 is undefined. Any result is possible.

(dx + dy) + dz == dx + (dy + dz) can be false when int range is wider than double (uncommon), and dx is a rounded value whose next higher representable value to 2 greater. (dx + 1) may round to dx, yet dx + 2 is representable. So (dx + 1) + 1 != dx + (1 + 1)

(dx * dy) * dz == dx * (dy * dz) is easily false when the sub-products and/or products are not exact - often when the mathematical product exceeds the exactly representable whole numbers of double.

dx / dx == dz / dz can simple be false when dx is 0 and division by 0 is not defined.