I have a large dataset of clusters with values for a parameter. Multiple clusters can have the same value.
I want to make a cumulative percent frequency distribution plot, with cumulative percentage of no. of clusters in y axis and the parameter values (which ranges from 0-1) on x axis.
I have sorted the data based on the values, but after that I am not sure how can I process it to get the cumulative plot using R (ecdf) or matplotlib. How can I approach this? Any help would be greatly appreciated.
My data looks like this
Cluster_20637 0.020
Cluster_20919 0.020
Cluster_9642 0.147
Cluster_10141 0.148
Cluster_21451 0.148
Cluster_30198 0.148
Cluster_55982 0.498
Cluster_10883 0.500
Cluster_16641 0.500
Cluster_20143 0.500
Cluster_57942 0.867
Cluster_32878 0.868
Cluster_26249 0.870
Cluster_46928 0.870
Cluster_41908 0.871
Cluster_28603 0.872
Cluster_1419 0.873
Here's a dump of the data as a
data.frame
calledtest
:Which looks like:
Generate a cumulative percentage variable
Then plot the data
Edit to address comment below:
Try this to group the clusters first:
Which gives this plot: