I posted question on stat stack exchange but unfortunately got no answer so far, so I clone it here and do hope someone can help.
I'm newbie in machine learning. Recently I tried to learn something on this and got following concern:
I have products classed by categories. Also I have users with gender and device model information.
First, I made a chi square test to check whether categories and gender + device information are associated. For example, my p-value is 0.000012 so I stated that the user (gender + device) is associated with categories.
So if a new user come with his gender (Female) + device (iPhone):
As the chi square test result, there should be an association between gender + device and categories. So I select top 10 categories that were consumed by Female who using iPhone. I've got the list, e.g. [1. Fashion, 2. Mobile devices 3. Cameras, 4. Home furnitures, 5. Bikes, etc.]
I also make a z-test on categories (without any user information), and got the list (higher z-score will be on top), e.g. [1. Mobile devices, 2. Bikes, 3. Fashion, 4. Laptops, etc.]
So in this case, which list should I give to that user? Or any possibility to combine them? Or did I do something wrong?
Thanks in advance :-)
Strictly speaking, none of the tests is appropriate. In both tests you have a null hypothesis (that gender and model is not related to category), and you are trying to find the probability that this hypothesis is wrong. However, theses two tests are parametric tests, that is for the results to be correct you have to know that the probability follows a specific distribution (chi square and normal distributions respectively). In your case you can make no such assumption, so the tests are not suitable. If you want to use significance tests, you should use a non-parametric tests, Wilcoxon and Friedman tests being the most common. However, significance tests are usually used after the problem has been solved to check if the results achieved can be attributed to luck. They are not used to solve the problem.
If you want to find correlation between gender, model and category, you should use some correlation coefficient, such as Pearson correlation and intraclass correlation. However, you have not described your data in detail, so I'm not sure what you are trying to achieve. Based on gender and model only, probably the safest and simplest thing you can do is return the most visited categories (number of occurences) by women who use iPhone.