I have a dataframe of the type:
userId | distrib1 | distrib2 | distrib3 ________________________________________ 125 21.2 20.6 1.1 143 19.7 16.2 3.2 426 23.5 22.1 9.4 ...
I want to somehow find a similarity measure (and compute it) between the columns
distrib3. I would provide here more detail or working code, but I don't have an idea where to start.
I know there exist distance metrics for probability distributions, but I don't know how to apply them to pandas columns.
One thing that would be useful is to split these values in buckets, and compare the overlap of the buckets between 2 of the columns.
I need to first count the number of users taking values in the interval [0,5] according to distrib1 and then in the same interval according to distrib2, then move on to the interval [5, 10] and do the same. Is there a simpler way of doing this?