Group a continuous variable in R

3.4k views Asked by At

My aim is to compare in a pivot table if there is a link between the presence of one particular shop and the density of population where we can find these shops. For that, I've a CSV file, with 600 exemples of areas where there is OR not the shop. It's a file with 600 lines and two columns : 1/ a number who represent the density of populaiton for one area, and 2/ the quantity of this particular shop in this area (0, 1 or 2).

In order to do a pivot table, I need to group the densities in 10 groups of 60 lines for each (in the first group the 60 bigger densities until the last group with the 60 smaller densities). Then, I'll easily be able to see how many shops are built, whether the density is low or high. Am I understandable (I hope) ? :)

Nothing really difficult I suppose. But there are some much way (and package) which could be ok for that... that I'm a little bit lost.

My main issue : which is the simplest way to group my variable in ten groups of 60 lines each ? I've tried cut()/cut2() and hist() without success, I heard about bin_var() and reshape() but I don't understand how they can be helpful for this case.


For example (as Justin asked). With cut():

data <- read.csv("data.csv", sep = ";")
groups <- cut(as.numeric(data$densit_pop2), breaks=10)
summary(groups)
(0.492,51.4]   (51.4,102]    (102,153]    (153,204]    (204,255]    (255,306] 
      53           53           52           52           52           54 
(306,357]    (357,408]    (408,459]    (459,510] 
      52           59           53           54 

Ok, good, indeed 'groups' contains 10 groups with almost the same number of lines. But certains values indicated in the intervals don't make any sens for me. Here is the first lines of density column (increasly sorted) :

> head(data$densit_pop2)
[1] 14,9 16,7 17,3 18,3 20,2 20,5
509 Levels: 100 1013,2 102,4 102,6 10328 103,6 10375 10396,8 104,2 ... 99,9

I mean, look at the first group. Why 0.492 when 14.9 is my smallest value ? And, if I count manually how many lines between the first one and the value 51.4, I find 76. Why is it indicated 53 lines ? I precise that the dataframe are correctly ranked from lowest to highest.

I certainly miss something... but what ?

2

There are 2 answers

1
aosmith On BEST ANSWER

I think you'll be happy with cut2 once you have a numeric variable to work with. When using commas as your decimal separator, use read.csv2 or use the argument dec = "," when reading in a dataset.

y = runif(600, 14.9, 10396.8)

require(Hmisc)
summary(cut2(y, m = 60))

You can do the same thing with cut, but you would need to set your breaks at the appropriate quantiles to get equal groups which takes a bit more work.

summary(cut(y, breaks = quantile(y, probs = seq(0, 1, 1/10)), include.lowest = TRUE))
1
IRTFM On

Responding to your data: you need to correct errors in data entry:

data$densit_pop3 <- as.numeric( 
                       sub('\\,', '.',  
                       as.character(data$densit_pop2)))

Then. Something along these lines (assuming this is not really a question about loading data from text files):

with(dfrm, by(dens, factor(shops), summary) )

As an example of hte output one might get:

with(BNP, by( proBNP.A, Sex, summary))

Sex: Female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    5.0    55.7   103.6   167.9   193.6  5488.0 3094899 
--------------------------------------------------------------------- 
Sex: Male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      5      30      63     133     129    5651 4013760 

If you are trying to plot this to look at the density of densities (which in this case seems like a reasonable request) then try this:

require(lattice)
densityplot( ~dens|shops, data=dfrm)

(And please do stop calling these "pivot tables". That is an aggregation strategy from Excel and one should really learn to describe the desired output in standard statistical or mathematical jargon.)