I have a dataframe of the type:

userId | distrib1  | distrib2 | distrib3
125        21.2        20.6       1.1
143        19.7        16.2       3.2
426        23.5        22.1       9.4

I want to somehow find a similarity measure (and compute it) between the columns distrib1, distrib2 and distrib3. I would provide here more detail or working code, but I don't have an idea where to start. I know there exist distance metrics for probability distributions, but I don't know how to apply them to pandas columns.

One thing that would be useful is to split these values in buckets, and compare the overlap of the buckets between 2 of the columns.

I need to first count the number of users taking values in the interval [0,5] according to distrib1 and then in the same interval according to distrib2, then move on to the interval [5, 10] and do the same. Is there a simpler way of doing this?

0 Answers