# Multicollinearity in Statsmodels GLM in sequential mode but correct output in batch submission

74 views Asked by At

I am having a problem fitting data using sm.GLM with Y vs spline basis function. Suppose I am using the following programming code:

``````size = 5000
c1 = np.repeat(np.arange(rounded_data.shape[0]), rounded_data.shape[0])
c2 = np.tile(np.arange(rounded_data.shape[0]), rounded_data.shape[0])
c3 = rounded_data.flatten()
dist = np.abs(c1 - c2) * size
dfs = pd.DataFrame({'c1': bin1, 'c2': bin2, 'c3': c3, 'distance': dist})
train_x = dfs["distance"]
train_y = dfs["c3"]
# Using natural cubic spline with degree 3
n_knots = 12

knots = np.percentile(train_x, np.linspace(0, 100, n_knots + 2)[1:-1])

transformed_x = dmatrix("cr(train_x, df=n_knots + 2)", {"train_x": train_x})

y_mean = np.mean(train_y)
y_var = np.var(train_y, ddof = 1)
alpha = y_mean ** 2 / (y_var - y_mean)

fit1 = sm.GLM(dfs["c3"], transformed_x, family=sm.families.NegativeBinomial(alpha=alpha))
result = fit1.fit()
print(result.summary())
``````

Now when I run it sequential mode it is showing the following result:

``````     Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                  c3   No. Observations:               131627
Model:                            GLM   Df Residuals:                   131613
Model Family:        NegativeBinomial   Df Model:                           13
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -4.9962e+05
Date:                Thu, 11 May 2023   Deviance:                   1.4181e+05
Time:                        17:31:42   Pearson chi2:                 1.50e+05
No. Iterations:                   100   Pseudo R-squ. (CS):             0.9999
Covariance Type:            nonrobust
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        8.07e+10   3.09e+10      2.616      0.009    2.02e+10    1.41e+11
x1          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
x2          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
x3          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
x4          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
...
``````

But if I run this from a slurm batch submission it produces the following:

``````Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                  c3   No. Observations:               133083
Model:                            GLM   Df Residuals:                   133069
Model Family:        NegativeBinomial   Df Model:                           13
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -4.2261e+05
Date:                Thu, 11 May 2023   Deviance:                   1.4775e+05
Time:                        17:12:12   Pearson chi2:                 1.55e+05
No. Iterations:                   100   Pseudo R-squ. (CS):             0.9998
Covariance Type:            nonrobust
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.9531      0.004    435.520      0.000       1.944       1.962
x1             3.7154      0.007    547.396      0.000       3.702       3.729
x2             1.4755      0.005    268.910      0.000       1.465       1.486
x3             1.1598      0.006    209.027      0.000       1.149       1.171
x4             0.6174      0.006    105.013      0.000       0.606       0.629
...
``````

Please note that other than batch submission everything is the same in terms of variables, conda environment, and other data and libraries. I am not sure why is this happening. Please help me with this. Thank you in advance.