Multicollinearity in Statsmodels GLM in sequential mode but correct output in batch submission

80 views Asked by user3901294 At 12 May 2023 at 00:45

I am having a problem fitting data using sm.GLM with Y vs spline basis function. Suppose I am using the following programming code:

size = 5000
rounded_data = np.loadtxt('test.txt')
c1 = np.repeat(np.arange(rounded_data.shape[0]), rounded_data.shape[0])
c2 = np.tile(np.arange(rounded_data.shape[0]), rounded_data.shape[0])
c3 = rounded_data.flatten()
dist = np.abs(c1 - c2) * size
dfs = pd.DataFrame({'c1': bin1, 'c2': bin2, 'c3': c3, 'distance': dist})
train_x = dfs["distance"]
train_y = dfs["c3"]
# Using natural cubic spline with degree 3
n_knots = 12

knots = np.percentile(train_x, np.linspace(0, 100, n_knots + 2)[1:-1])

transformed_x = dmatrix("cr(train_x, df=n_knots + 2)", {"train_x": train_x})

y_mean = np.mean(train_y)
y_var = np.var(train_y, ddof = 1)
alpha = y_mean ** 2 / (y_var - y_mean)

fit1 = sm.GLM(dfs["c3"], transformed_x, family=sm.families.NegativeBinomial(alpha=alpha))
result = fit1.fit()
print(result.summary())

Now when I run it sequential mode it is showing the following result:

     Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  c3   No. Observations:               131627
Model:                            GLM   Df Residuals:                   131613
Model Family:        NegativeBinomial   Df Model:                           13
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -4.9962e+05
Date:                Thu, 11 May 2023   Deviance:                   1.4181e+05
Time:                        17:31:42   Pearson chi2:                 1.50e+05
No. Iterations:                   100   Pseudo R-squ. (CS):             0.9999
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        8.07e+10   3.09e+10      2.616      0.009    2.02e+10    1.41e+11
x1          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
x2          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
x3          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
x4          -8.07e+10   3.09e+10     -2.616      0.009   -1.41e+11   -2.02e+10
...

But if I run this from a slurm batch submission it produces the following:

Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  c3   No. Observations:               133083
Model:                            GLM   Df Residuals:                   133069
Model Family:        NegativeBinomial   Df Model:                           13
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -4.2261e+05
Date:                Thu, 11 May 2023   Deviance:                   1.4775e+05
Time:                        17:12:12   Pearson chi2:                 1.55e+05
No. Iterations:                   100   Pseudo R-squ. (CS):             0.9998
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.9531      0.004    435.520      0.000       1.944       1.962
x1             3.7154      0.007    547.396      0.000       3.702       3.729
x2             1.4755      0.005    268.910      0.000       1.465       1.486
x3             1.1598      0.006    209.027      0.000       1.149       1.171
x4             0.6174      0.006    105.013      0.000       0.606       0.629
...

Please note that other than batch submission everything is the same in terms of variables, conda environment, and other data and libraries. I am not sure why is this happening. Please help me with this. Thank you in advance.

Original Q&A

TechQA.

Multicollinearity in Statsmodels GLM in sequential mode but correct output in batch submission

There are 0 answers

Related Questions in PYTHON

Related Questions in CONDA

Related Questions in STATSMODELS

Related Questions in SPLINE

Popular Questions

Popular Tags

Trending Questions