How do I reproduce a SGDClassifier with modified_huber loss?

63 views Asked by At

I have a model defined like so:

rng = 42
model = Pipeline([
    ('scaler', RobustScaler()),
    ('feature', SelectKBest(k=42)),
    ('model', SGDClassifier(loss='modified_huber', shuffle=True, random_state=rng))
])

That when I train+predict in two separate program executions (one ad-hoc, another with a cron job) with the exact same inputs, I get different model weights, and thus, prediction results.

I noticed that 'hinge' loss is the only reproducible model with the exact same weights. What is it about the other loss functions that prevent them from being reproduced?

I've checked and double-checked that the inputs are the same, and verified with other loss functions.

1

There are 1 answers

0
kekekekyle On

Ok, I've tracked it down. There were TINY differences between the X datasets. I.e. I had values like 12.799 vs 12.8, but only a handful of them (<10 instances in >3k rows, >200 columns). I didn't think this would have such a large domino effect on the resulting models.

Rounding all data to 2 decimal places resulted in the exact same models being produced.