I have come across a peculiar situation when preprocessing data.
Let's say I have a dataset
A. I split the dataset into
A_test. I fit the
A_train using any of the given scalers (sci-kit learn) and transform
A_test with that
scaler. Now training the neural network with
A_train and validating on
A_test works well. No overfitting and performance is good.
Let's say I have dataset
B with the same features as in
A, but with different ranges of values for the features. A simple example of
B could be Boston and Paris housing datasets respectively (This is just an analogy to say that features ranges like the cost, crime rate, etc vary significantly ). To test the performance of the above trained model on
B, we transform
B according to scaling attributes of
A_train and then validate. This usually degrades performance, as this model is never shown the data from
The peculiar thing is if I fit and transform on
B directly instead of using scaling attributes of
A_train, the performance is a lot better. Usually, this reduces performance if I test this on
A_test. In this scenario, it seems to work, although it's not right.
Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.
Any ideas, please.
PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied
QuantileTransformer, it improved performance but could be better.