The same variable obtained using round function different results when printed twice

47 views Asked by At

My complete code is as follows, python version is 3.8.12

import xgboost as xgb

from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

# Obtaining the data set, method 1
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target

# Obtaining the data set, method 2
# data_url = "http://lib.stat.cmu.edu/datasets/boston"
# raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
# X  = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
# y = raw_df.values[1::2, 2]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

model = xgb.train({'objective': 'reg:squarederror'}, dtrain)

y_pred = model.predict(dtest)

mean_col1 = round(y_test.mean(), 4)
mean_col2 = round(y_pred.mean(), 4)

# first print
print(mean_col1, mean_col2)

# second print
print(f"real price avg: {mean_col1}, predict price avg: {mean_col2}")

the output is

21.4882 20.5224
real price avg: 21.4882, predict price avg: 20.52239990234375

My question is why the last number is not retained to four decimal places.

The first time I printed mean_col2, it was retained to four decimal places, but the second time I printed it, it was obviously the same variable, but I couldn't get the same result.

I have tried it in jupyter notebook, ipython, and py files, and the results are the same. And this happened only to mean_col2.

Please tell me what is the reason and how to solve it, thank you in advance.

2

There are 2 answers

0
NotAName On

It appears to just be a floating point precision error coming in the second print statement.

If you take your code, and print hex values instead of decimal representation you'll see that in both cases we're printing exactly the same bytes:

---snip---

# first print
print(float(mean_col1).hex(), float(mean_col2).hex())

# second print
print(f"real price avg: {float(mean_col1).hex()}, predict price avg: {float(mean_col2).hex()}")
Out:
0x1.57cfaacd9e83ep+4 0x1.53edfa0000000p+4
real price avg: 0x1.57cfaacd9e83ep+4, predict price avg: 0x1.53edfa0000000p+4

I'm not exactly sure why the floating point precision error only shows up in the second case, but to have the second statement print the result to 4 decimal places, you just add formatting to your f-string:

print(f"real price avg: {mean_col1:.4f}, predict price avg: {mean_col2:.4f}")
0
Andrew On

This discrepancy arises from the differences in how Python handles direct rounding of floating-point numbers with the round function vs. how it converts these numbers to strings in formatted strings (f-strings). some context here

When you round the number using round(number, 4) and print it directly, Python displays the number with the specified number of decimal places. This is because the round function explicitly applies the rounding to the number's representation.

However, when you include a variable in an f-string, Python converts the variable to a string representation without considering any previous rounding you've applied. The f-string uses the default string representation of the floating-point number, which doesn't guarantee that it will adhere to the same level of precision you applied with round

Here's how you can modify your f-string to ensure the number is always displayed with four decimal places:

print(f"real price avg: {mean_col1:.4f}, predict price avg: {mean_col2:.4f}")