Using Python to find correlation pairs

775 views Asked by At
    NAME    PRICE   SALES   VIEWS   AVG_RATING  VOTES   COMMENTS    
Module 1    $12.00     69   12048           5      3          26    
Module 2    $24.99     12   52858           5      1          14    
Module 3    $10.00      1   1381           -1      0           0    
Module 4    $22.99     46   57841           5      8          24    
.................

So, Let's say I have statistics of sales. I would like to find out:

  1. How Price/etc impact on Sales?
  2. Detect which are features the most impactable?
  3. Which should be optimized price to achieve the most sales?

Please advise which Python libraries can help here? Any example would be great here!

1

There are 1 answers

0
Jianxun Li On BEST ANSWER

The python machine learning library scikit-learn is most appropriate in your case. There is a sub-module called feature_selection fits exactly your needs. Here is an example.

from sklearn.datasets import make_regression

# simulate a dataset with 500 factors, but only 5 out of them are truely 
# informative factors, all the rest 495 are noises. assume y is your response
# variable 'Sales', and X are your possible factors
X, y = make_regression(n_samples=1000, n_features=500, n_informative=5, noise=5)

X.shape
Out[273]: (1000, 500)
y.shape
Out[274]: (1000,)

from sklearn.feature_selection import f_regression
# regressing Sales on each of factor individually, get p-values
_, p_values = f_regression(X, y)
# select significant factors p < 0.05
mask = p_values < 0.05
X_informative = X[:, mask]

X_informative.shape
Out[286]: (1000, 38)

Now, we see only 38 out of 500 features are selected.

To further build a predictive model, we could consider the popular GradientBoostRegression.

from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(n_estimators=100)
# fit our model
gbr.fit(X_informative, y)
# generate predictions
gbr_preds = gbr.predict(X_informative)

# calculate erros and plot it
gbr_error = y - gbr_preds

fig, ax = plt.subplots()
ax.hist(y, label='y', alpha=0.5)
ax.hist(gbr_error, label='errors in predictions', alpha=0.4)
ax.legend(loc='best')

enter image description here

From the graph, we see the model did pretty a good job: Most variation in 'Sales' has been captured by our model.