I am having a problem fitting data using sm.GLM with Y vs spline basis function. Suppose I am using the following programming code:
size = 5000
rounded_data = np.loadtxt('test.txt')
c1 = np.repeat(np.arange(rounded_data.shape[0]), rounded_data.shape[0])
c2 = np.tile(np.arange(rounded_data.shape[0]), rounded_data.shape[0])
c3 = rounded_data.flatten()
dist = np.abs(c1 - c2) * size
dfs = pd.DataFrame({'c1': bin1, 'c2': bin2, 'c3': c3, 'distance': dist})
train_x = dfs["distance"]
train_y = dfs["c3"]
# Using natural cubic spline with degree 3
n_knots = 12
knots = np.percentile(train_x, np.linspace(0, 100, n_knots + 2)[1:-1])
transformed_x = dmatrix("cr(train_x, df=n_knots + 2)", {"train_x": train_x})
y_mean = np.mean(train_y)
y_var = np.var(train_y, ddof = 1)
alpha = y_mean ** 2 / (y_var - y_mean)
fit1 = sm.GLM(dfs["c3"], transformed_x, family=sm.families.NegativeBinomial(alpha=alpha))
result = fit1.fit()
print(result.summary())
Now when I run it sequential mode it is showing the following result:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: c3 No. Observations: 131627
Model: GLM Df Residuals: 131613
Model Family: NegativeBinomial Df Model: 13
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -4.9962e+05
Date: Thu, 11 May 2023 Deviance: 1.4181e+05
Time: 17:31:42 Pearson chi2: 1.50e+05
No. Iterations: 100 Pseudo R-squ. (CS): 0.9999
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 8.07e+10 3.09e+10 2.616 0.009 2.02e+10 1.41e+11
x1 -8.07e+10 3.09e+10 -2.616 0.009 -1.41e+11 -2.02e+10
x2 -8.07e+10 3.09e+10 -2.616 0.009 -1.41e+11 -2.02e+10
x3 -8.07e+10 3.09e+10 -2.616 0.009 -1.41e+11 -2.02e+10
x4 -8.07e+10 3.09e+10 -2.616 0.009 -1.41e+11 -2.02e+10
...
But if I run this from a slurm batch submission it produces the following:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: c3 No. Observations: 133083
Model: GLM Df Residuals: 133069
Model Family: NegativeBinomial Df Model: 13
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -4.2261e+05
Date: Thu, 11 May 2023 Deviance: 1.4775e+05
Time: 17:12:12 Pearson chi2: 1.55e+05
No. Iterations: 100 Pseudo R-squ. (CS): 0.9998
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.9531 0.004 435.520 0.000 1.944 1.962
x1 3.7154 0.007 547.396 0.000 3.702 3.729
x2 1.4755 0.005 268.910 0.000 1.465 1.486
x3 1.1598 0.006 209.027 0.000 1.149 1.171
x4 0.6174 0.006 105.013 0.000 0.606 0.629
...
Please note that other than batch submission everything is the same in terms of variables, conda environment, and other data and libraries. I am not sure why is this happening. Please help me with this. Thank you in advance.