Transforming irregular data into usable format in R

148 views Asked by At

Suppose I work in a company that provides a number of different services to their customers. I have been given a data report on the services provided that I need to analyse. The report is formatted in a way to be easily read and printed, but not suitable for data analysis.

The format of the report is as follows:

Input:

customer <- c(1,2,2,3,3,3)
service1 <- c(1,3,5,1,3,5)
fee1 <- c(100,290,500,100,300,500)
service2 <- c("",4,"",2,4,8)
fee2 <- c("",400,"",200,390,800)

require(data.table)
DT <- data.table(customer, service1, fee1, service2, fee2)

Which prints to:

> DT
   customer service1 fee1 service2 fee2
1:        1        1  100              
2:        2        3  290        4  400
3:        2        5  500              
4:        3        1  100        2  200
5:        3        3  300        4  390
6:        3        5  500        8  800

There are a number of customers and for each there are a number of services they have consumed and the corresponding fees. The services and fees are printed horizontally in four columns and then overflow into a new line. There can be any number of services for each customer, but each service can only occur once for each customer and the fee for a service can differ for each customer. They are probably always printed in the same order, though the solution should not rely on this.

The task is to transform the data into a useable format. I see two different ways to do this.

First option (long format): slice off the last two columns, create a new line for each costumer, and fill in the contents.

Option one would look like this:

    customer service fee
 1:        1       1 100
 2:        2       3 290
 3:        2       4 400
 4:        2       5 500
 5:        3       1 100
 6:        3       2 200
 7:        3       3 300
 8:        3       4 390
 9:        3       5 500
10:        3       8 800

Second option (wide format): slice off all rows but the first for each customer, create new columns for the sliced-off services and then transform the service labels into column headers (and make sure everything is in the right place).

Option two would look like this:

   customer service.1 service.2 service.3 service.4 service.5 service.6 service.7 service.8
1:        1       100                                                                      
2:        2                           290       400       500                              
3:        3                 200       300       390       500                           800

I can work with either format (and transformation between long and wide is fairly easy).

As a starting point, I figured that I would have to find either the number of services for each customer (option 1) or the number of unique services (option 2), expand the data table to the needed size and move around the data.

I feel like data.table should be able to handle this and would prefer a solution using this package due to its efficiency.

1

There are 1 answers

3
David Arenburg On BEST ANSWER

I don't see how this is solvable using melt, but you can use a simple rbind here, for example

res <- rbind(DT[, c(1,2:3)], 
             DT[, c(1,4:5)], 
                 use.names = FALSE)[service1 != ""]
res
#     customer service1 fee1
#  1:        1        1  100
#  2:        2        3  290
#  3:        2        5  500
#  4:        3        1  100
#  5:        3        3  300
#  6:        3        5  500
#  7:        2        4  400
#  8:        3        2  200
#  9:        3        4  390
# 10:        3        8  800

As per your second output, you could try something like

Range <- range(as.numeric(unlist(DT[, c(1, 4)])), na.rm = TRUE)
res[, service1 := factor(service1, levels = Range[1L]:Range[2L])]
dcast(res, customer ~ service1, drop = FALSE, fill = "", value.var = "fee1")
#    customer   1   2   3   4   5 6 7   8
# 1:        1 100                        
# 2:        2         290 400 500        
# 3:        3 100 200 300 390 500     800