Merging Daily and Periodic Data into one Dataframe

344 views Asked by At

I am trying to construct a panel-data data frame which consists of both periodic and 'continous' daily data which should be assigned to each other, so that each row of the new data frame has both, the period, the value for the periodic data and the value and day of one of the days within that period, the data looks similar to this:

> dailycds
         Date CDS
1  30-06-2015 194
2  01-07-2015 195
3  02-07-2015 198
4  03-07-2015 198
5  04-07-2015 199
6  30-06-2016 165
7  01-07-2016 172
8  02-07-2016 213
9  03-07-2016 123
10 04-07-2016 321


> periodicassets
  Period Assets
1 201506   1314
2 201606   2134

And ultimately, I would like it to look like this:

  > df
Period       Date Assets CDS
1 201506 30-06-2015   1314 194
2 201506 01-07-2015   1314 195
3 201506 02-07-2015   1314 198
4 201506 03-07-2015   1314 198
5 201606 30-06-2016   2134 165
6 201606 01-07-2016   2134 172
7 201606 02-07-2016   2134 213
8 201606 03-07-2016   2134 123

So basically, the idea would be to take certain ranges of rows from the daily data and assign (and merge) them to the periodic data. Unfortunately though, I cannot simply do that by extracting the mm-yyyy parts of the dates, as the period of 201506 also contains data on July up to the third, whereas the forth relates to no period and should be dropped, as each period should only contain a certain numbers of days (in this case 4).

Here is the code to great above sample data:

dailycds = data.frame(Date = c("30-06-2015", "01-07-2015", "02-07-2015","03-07-2015","04-07-2015","30-06-2016", "01-07-2016", "02-07-2016","03-07-2016","04-07-2016"),
                      CDS = c(194, 195, 198,198,199,165,172,213,123,321))
dailycds

periodicassets = data.frame(Period = c("201506", "201606"),
                            Assets = c("1314","2134"))
periodicassets

df = data.frame(Period = c("201506", "201506", "201506", "201506", "201606", "201606", "201606", "201606"),
                Date = c("30-06-2015", "01-07-2015", "02-07-2015","03-07-2015", "30-06-2016", "01-07-2016", "02-07-2016", "03-07-2016"),
                Assets = c("1314", "1314", "1314", "1314", "2134", "2134", "2134", "2134"),
                CDS = c(194, 195, 198, 198, 165, 172, 213, 123))

Context and Additional Complications

So as suggested in the given solutions, my previous example was quite specific and probably overly simplified. Thus, to get a little closer to my problem here is some additional context: Ultimately, the periodic data refers to end of month holdings of banks' assets, to which I want to assign the daily CDS data around the time (e.g.) 3 days before the end of the month and 6 days afterwards. So in the panel of course there are multiple banks, for each of which the (same) CDS data has to be assigned to its holdings. (E.g. if I have 2 banks for which I need 3 days before and 6 after the end of the month I have (3+1+6)*2 days.) As pointed out in the comments, I always refer to business/working days in my question, as my time series do not contain any holidays etc..

So, to do justice to the problem, here is a snippet from the original with only one period:

> periodicassets
            BankName Period     value 
  2             BPCE 201412 112189.50
  4  Credit Agricole 201412  81618.76

    Date                CDS
   <dttm>              <chr>
  1 2015-01-12             46.869
  2 2015-01-09 48.121000000000002
  3 2015-01-08 48.625999999999998
  4 2015-01-07 48.801000000000002
  5 2015-01-06 48.633000000000003
  6 2015-01-05 46.670999999999999
  7 2015-01-02 45.158000000000001
  8 2015-01-01              47.32
  9 2014-12-31 47.658000000000001
 10 2014-12-30 45.843000000000004
 11 2014-12-29 47.588999999999999
 12 2014-12-26 47.625999999999998
 13 2014-12-25 47.697000000000003
 14 2014-12-24 47.414999999999999
 15 2014-12-23 48.075000000000003
 16 2014-12-22 48.085999999999999
 17 2014-12-19 47.496000000000002
 18 2014-12-18 46.534999999999997
 19 2014-12-17 48.149000000000001

which can be accessed here: periodic assets, dailycds

When looking through the forum, I found a similar questions such as: create an index for aggregating daily data to match periodic data and create an index for aggregating daily data to match periodic data, however, while the first one tries to aggregate the data, the second one is already blessed with the format I want to have (in object xtime).

2

There are 2 answers

4
Uwe On BEST ANSWER

The key issue of this question is how Period is mapped to Date. From OP's decription I have understood that each period includes the last day of the actual month plus the first three days into the next month, 4 days in total.

This can be solved with some date arithmetic and a right join:

library(data.table)
result <- 
  # coerce to data.table
  setDT(dailycds)[
    # compute period by subtracting 3 days of date
    , Period := format(as.IDate(Date, "%d-%m-%Y") - 3L, "%Y%m")][
      # right join, dropping all rows from dailycds without matching period
      periodicassets, on = "Period"][
        # change column order to be in line with expected result df
      , setcolorder(.SD, names(df))]
result
   Period       Date Assets CDS
1: 201506 30-06-2015   1314 194
2: 201506 01-07-2015   1314 195
3: 201506 02-07-2015   1314 198
4: 201506 03-07-2015   1314 198
5: 201606 30-06-2016   2134 165
6: 201606 01-07-2016   2134 172
7: 201606 02-07-2016   2134 213
8: 201606 03-07-2016   2134 123

There are only 4 rows per period as requested and the result is in line with the expected result df:

all.equal(df, as.data.frame(result[, lapply(.SD, forcats::fct_drop)]))
[1] TRUE

Unused levels have to be dropped to pass the strict checks of all.equal()

Caveat

The code has been tested to work with the sample data provided. In case of continuous daily as well as periodic data it might be necessary to add code to remove days which do not belong to the 4 days period.


Edit: More realistic sample data

The OP has updated his question and is providing more realistic sample data via dropbox. Now, dailycds contains daily data (except for weekends). As already mentioned in the Caveat above, this requires dailycds to be filtered for the relevant days.

The OP isn't clear how the days to be considered before and after the turn of the month are defined. Here, we assume that 3 days before the end of the month and 6 days afterwards refers to calendar days and not business days.

# define day range of interest relativ to turn of the month
days_before <- 3L
days_after  <- 6L
stopifnot(days_before + days_after < 28)

# read data from dropbox links, note ?dl=1 
dailycds <- readRDS(url("https://www.dropbox.com/s/r7v5dq6la0mnn71/dailycds.RDS?dl=1"))
periodicassets <-
  readRDS(url("https://www.dropbox.com/s/gdflcngwp8nm552/periodicassets.RDS?dl=1"))

library(data.table)
# coerce to data.table
setDT(dailycds)[
  # filter calendar dates
  mday(Date) <= days_after | mday(Date) > lubridate::days_in_month(Date) - days_before][
    # compute period by shifting dates from next month into actual month
    # coersion to IDate is required because Date is of class POSIXct 
    , Period := format(as.IDate(Date) - days_after, "%Y%m")][
      # right join, dropping all rows from dailycds without matching period
      setDT(periodicassets), on = "Period"][]
          Date                CDS Period         BankName     value
 1: 2015-01-06 48.633000000000003 201412             BPCE 112189.50
 2: 2015-01-05 46.670999999999999 201412             BPCE 112189.50
 3: 2015-01-02 45.158000000000001 201412             BPCE 112189.50
 4: 2015-01-01              47.32 201412             BPCE 112189.50
 5: 2014-12-31 47.658000000000001 201412             BPCE 112189.50
 6: 2014-12-30 45.843000000000004 201412             BPCE 112189.50
 7: 2014-12-29 47.588999999999999 201412             BPCE 112189.50
 8: 2015-02-06 47.265000000000001 201501             BPCE 103142.06
 9: 2015-02-05 47.073999999999998 201501             BPCE 103142.06
10: 2015-02-04 46.634999999999998 201501             BPCE 103142.06
11: 2015-02-03 46.405000000000001 201501             BPCE 103142.06
12: 2015-02-02             47.567 201501             BPCE 103142.06
13: 2015-01-30 47.396000000000001 201501             BPCE 103142.06
14: 2015-01-29 48.448999999999998 201501             BPCE 103142.06
15: 2015-01-06 48.633000000000003 201412  Credit Agricole  81618.76
16: 2015-01-05 46.670999999999999 201412  Credit Agricole  81618.76
...
26: 2015-02-02             47.567 201501  Credit Agricole  73987.36
27: 2015-01-30 47.396000000000001 201501  Credit Agricole  73987.36
28: 2015-01-29 48.448999999999998 201501  Credit Agricole  73987.36
          Date                CDS Period         BankName     value

Edit 2: Using business days instead of calendar dates.

The OP has clarified that he is using buiness days instead of calendar days. This seemingly minor change of the specification has a severe impact on the way the dates to be included are selected.

Now, always the first 6 entries in each month are picked as well as the last 3 entries before the last trading day of the month (ultimo) and the ultimo itself which results in 3 + 1 + 6 = 10 business days to pick.

# define range of business days relative to the last trading day (ultimo)
days_before <- 3L
days_after  <- 6L
stopifnot(days_before + days_after < 28)

library(data.table)
# read data from dropbox links, note ?dl=1 
dailycds <- readRDS(url("https://www.dropbox.com/s/r7v5dq6la0mnn71/dailycds.RDS?dl=1"))
periodicassets <- readRDS(url("https://www.dropbox.com/s/gdflcngwp8nm552/periodicassets.RDS?dl=1"))
# coerce to data.table
setDT(dailycds)[
  # filter business dates: 
  # for each month pick the first days_after business days into the month 
  # and the last days_before biz days before and including ultimo
  dailycds[, c(head(.I, days_after), tail(.I, days_before + 1L)), 
           by = .(year(Date), month(Date))]$V1][
    # compute period by shifting dates from next month into actual month
    # coersion to IDate is required because Date is of class POSIXct 
    , Period := format(as.IDate(Date) - days_after, "%Y%m")][
      # right join, dropping all rows from dailycds without matching period
      setDT(periodicassets), on = "Period"][]
          Date                CDS Period         BankName     value
 1: 2015-01-06 48.633000000000003 201412             BPCE 112189.50
 2: 2015-01-05 46.670999999999999 201412             BPCE 112189.50
 3: 2015-01-02 45.158000000000001 201412             BPCE 112189.50
 4: 2015-01-01              47.32 201412             BPCE 112189.50
 5: 2014-12-31 47.658000000000001 201412             BPCE 112189.50
 6: 2014-12-30 45.843000000000004 201412             BPCE 112189.50
 7: 2014-12-29 47.588999999999999 201412             BPCE 112189.50
 8: 2014-12-26 47.625999999999998 201412             BPCE 112189.50
 9: 2014-12-25 47.697000000000003 201412             BPCE 112189.50
10: 2014-12-24 47.414999999999999 201412             BPCE 112189.50
11: 2015-02-05 47.073999999999998 201501             BPCE 103142.06
12: 2015-02-04 46.634999999999998 201501             BPCE 103142.06
13: 2015-02-03 46.405000000000001 201501             BPCE 103142.06
14: 2015-02-02             47.567 201501             BPCE 103142.06
15: 2015-01-30 47.396000000000001 201501             BPCE 103142.06
16: 2015-01-29 48.448999999999998 201501             BPCE 103142.06
17: 2015-01-28             49.442 201501             BPCE 103142.06
18: 2015-01-27 49.502000000000002 201501             BPCE 103142.06
19: 2015-01-26              49.73 201501             BPCE 103142.06
20: 2015-01-23 50.917000000000002 201501             BPCE 103142.06
21: 2015-01-06 48.633000000000003 201412  Credit Agricole  81618.76
22: 2015-01-05 46.670999999999999 201412  Credit Agricole  81618.76
...
39: 2015-01-26              49.73 201501  Credit Agricole  73987.36
40: 2015-01-23 50.917000000000002 201501  Credit Agricole  73987.36
          Date                CDS Period         BankName     value

Note that the result data set contains (3 + 1 + 6) * 2 months * 2 banks = 40 rows.

Data from dropbox

In case the dropbox links break:

dailycds <- 
structure(list(Date = structure(c(1424649600, 1424390400, 1424304000, 
1424217600, 1424131200, 1424044800, 1423785600, 1423699200, 1423612800, 
1423526400, 1423440000, 1423180800, 1423094400, 1423008000, 1422921600, 
1422835200, 1422576000, 1422489600, 1422403200, 1422316800, 1422230400, 
1421971200, 1421884800, 1421798400, 1421712000, 1421625600, 1421366400, 
1421280000, 1421193600, 1421107200, 1421020800, 1420761600, 1420675200, 
1420588800, 1420502400, 1420416000, 1420156800, 1420070400, 1419984000, 
1419897600, 1419811200, 1419552000, 1419465600, 1419379200, 1419292800, 
1419206400, 1418947200, 1418860800, 1418774400, 1418688000, 1418601600, 
1418342400, 1418256000, 1418169600, 1418083200, 1417996800, 1417737600, 
1417651200, 1417564800, 1417478400, 1417392000, 1417132800, 1417046400, 
1416960000, 1416873600, 1416787200, 1416528000, 1416441600, 1416355200, 
1416268800, 1416182400, 1415923200, 1415836800, 1415750400, 1415664000, 
1415577600, 1415318400, 1415232000, 1415145600, 1415059200, 1414972800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), CDS = c("44.259", 
"44.555999999999997", "45.076999999999998", "44.951000000000001", 
"45.762", "45.573", "45.634999999999998", "45.956000000000003", 
"47.064", "47.51", "48.576999999999998", "47.265000000000001", 
"47.073999999999998", "46.634999999999998", "46.405000000000001", 
"47.567", "47.396000000000001", "48.448999999999998", "49.442", 
"49.502000000000002", "49.73", "50.917000000000002", "51.37", 
"52.536999999999999", "49.188000000000002", "47.893999999999998", 
"46.728000000000002", "46.634999999999998", "46.366999999999997", 
"47.012999999999998", "46.869", "48.121000000000002", "48.625999999999998", 
"48.801000000000002", "48.633000000000003", "46.670999999999999", 
"45.158000000000001", "47.32", "47.658000000000001", "45.843000000000004", 
"47.588999999999999", "47.625999999999998", "47.697000000000003", 
"47.414999999999999", "48.075000000000003", "48.085999999999999", 
"47.496000000000002", "46.534999999999997", "48.149000000000001", 
"49.421999999999997", "48.223999999999997", "47.100999999999999", 
"47.484999999999999", "47.491999999999997", "47.052", "46.697000000000003", 
"44.670999999999999", "47.706000000000003", "46.835000000000001", 
"48.66", "46.841999999999999", "48.069000000000003", "49.49", 
"50.155000000000001", "50.155000000000001", "50.49", "52.024000000000001", 
"50.33", "50", "50.67", "53.15", "52.994999999999997", "55.31", 
"50.82", "50.49", "50.832999999999998", "52.241", "51.97", "52.8", 
"50.667000000000002", "51.134999999999998")), .Names = c("Date", 
"CDS"), row.names = c(NA, -81L), class = c("tbl_df", "tbl", "data.frame"))

periodicassets <- 
structure(list(BankName = c(" BPCE", " BPCE", " Credit Agricole", 
" Credit Agricole"), Period = c("201412", "201501", "201412", 
"201501"), value = c(112189.50293406, 103142.064337463, 81618.762099507, 
73987.36251389)), .Names = c("BankName", "Period", "value"), row.names = c(10L, 
11L, 18L, 19L), class = "data.frame")
7
cephalopod On

See if this works for you

library(lubridate)
library(dplyr)
library(tidyr)

periodicassets <- periodicassets %>%
        mutate(Date = ymd(paste(Period, "01", sep = ""))) %>%
        select(-Period)


dailycds$Date <- dmy(dailycds$Date)

full_join(dailycds, periodicassets) %>% 
        arrange(Date) %>% fill(Assets, .direction = "down") %>%
        na.omit

Joining, by = "Date"

         Date CDS Assets
2  2015-06-30 194   1314
3  2015-07-01 195   1314
4  2015-07-02 198   1314
5  2015-07-03 198   1314
6  2015-07-04 199   1314
8  2016-06-30 165   2134
9  2016-07-01 172   2134
10 2016-07-02 213   2134
11 2016-07-03 123   2134
12 2016-07-04 321   2134