Using equivalents of Countifs & Lookups to summarise data in R

191 views Asked by At

I have a file of transactional data (trips between locations) that I am looking to summarise using R, which I am relatively new to. Sample Data

    Start.Date          Start.Area         End.Date            End.Area
    2007-07-12 14:00    New Street         2007-07-12 15:46    Windy Lane
    2007-07-12 15:10    High Street        2007-07-12 18:08    New Street
    2007-07-12 16:42    Kings Street       2007-07-12 17:47    Windy Lane

My aim is to return the occurances for each day (potentially hour) for an area.

Sample Return, in a new data.frame, would be

    Date                Area               Start.Occurances   End.Occurances           
    2007-07-12          New Street         1                  1
    2007-07-12          High Street        1                  0
    2007-07-12          Kings Street       1                  0
    2007-07-12          Windy Lane         0                  2

Ideally I would have carried out the analysis in Excel but it can't cope with the scale of my data. In a spreadsheet I would use countif functions to see how many times the area appears in a given date/time.

I also hope to incorporate days when both Start.Occurances and End.Occurances are zero if possible.

The questions I have seen already regarding Countif functions or Match/Index combination haven't addressed my query so I was hoping somebody out there could help me!

1

There are 1 answers

1
Rorschach On BEST ANSWER

This can be done by first reshaping and then summarising. Here is an example using reshape2 and dplyr (data is dat).

## First reshape the data
library(reshape2)
m <- melt(dat, id.vars=c("Start.Date", "End.Date"), 
          value.name = "Area", variable.name="Area.Pos")

## Summarise, grouping by Area
library(dplyr)
m %>% group_by(Area) %>% 
  summarise(Start.Occurences = sum(Area.Pos == "Start.Area"),
            End.Occurences = sum(Area.Pos == "End.Area"))
#            Area Start.Occurences End.Occurences
# 1   High Street                1              0
# 2  Kings Street                1              0
# 3    New Street                1              1
# 4    Windy Lane                0              2

Another way: stack 'Start.Date' and 'Start.Area' columns on top of the corresponding 'End' columns, renaming the columns to 'Date' and 'Area' with a new column 'Pos' that specifies if it is a 'Start' or 'End'. Then it is easy to summarise by grouping Area, Date, or both.

m <- rbind(`names<-`(dat[,grep("Start", names(dat))], c("Date", "Area")),
           `names<-`(dat[,grep("End", names(dat))], c("Date", "Area")))
m$Pos <- rep(c("Start", "End"), each=nrow(dat))

m %>% group_by(as.Date(Date), Area) %>%
  summarise(Start.Occurences = sum(Pos == "Start"),
            End.Occurences = sum(Pos == "End"))
  as.Date(Date)          Area Start.Occurences End.Occurences
# 1    2007-07-12   High Street                1              0
# 2    2007-07-12  Kings Street                1              0
# 3    2007-07-12    New Street                1              1
# 4    2007-07-12    Windy Lane                0              2