My aim is to compare in a pivot table if there is a link between the presence of one particular shop and the density of population where we can find these shops. For that, I've a CSV file, with 600 exemples of areas where there is OR not the shop. It's a file with 600 lines and two columns : 1/ a number who represent the density of populaiton for one area, and 2/ the quantity of this particular shop in this area (0, 1 or 2).
In order to do a pivot table, I need to group the densities in 10 groups of 60 lines for each (in the first group the 60 bigger densities until the last group with the 60 smaller densities). Then, I'll easily be able to see how many shops are built, whether the density is low or high. Am I understandable (I hope) ? :)
Nothing really difficult I suppose. But there are some much way (and package) which could be ok for that... that I'm a little bit lost.
My main issue : which is the simplest way to group my variable in ten groups of 60 lines each ? I've tried cut()/cut2() and hist() without success, I heard about bin_var() and reshape() but I don't understand how they can be helpful for this case.
For example (as Justin asked). With cut():
data <- read.csv("data.csv", sep = ";")
groups <- cut(as.numeric(data$densit_pop2), breaks=10)
summary(groups)
(0.492,51.4] (51.4,102] (102,153] (153,204] (204,255] (255,306]
53 53 52 52 52 54
(306,357] (357,408] (408,459] (459,510]
52 59 53 54
Ok, good, indeed 'groups' contains 10 groups with almost the same number of lines. But certains values indicated in the intervals don't make any sens for me. Here is the first lines of density column (increasly sorted) :
> head(data$densit_pop2)
[1] 14,9 16,7 17,3 18,3 20,2 20,5
509 Levels: 100 1013,2 102,4 102,6 10328 103,6 10375 10396,8 104,2 ... 99,9
I mean, look at the first group. Why 0.492 when 14.9 is my smallest value ? And, if I count manually how many lines between the first one and the value 51.4, I find 76. Why is it indicated 53 lines ? I precise that the dataframe are correctly ranked from lowest to highest.
I certainly miss something... but what ?
I think you'll be happy with
cut2
once you have a numeric variable to work with. When using commas as your decimal separator, useread.csv2
or use the argumentdec = ","
when reading in a dataset.You can do the same thing with
cut
, but you would need to set your breaks at the appropriate quantiles to get equal groups which takes a bit more work.