My first use of nest/unnest functions in R and I do not understand the result. I nest and immediately unnest and compare the before/after dataframes. Why are the dataframes not identical?
> library(tidyverse)
> concentration_original <- readRDS("./Data/concentration.Rds")
> print(concentration_original, n=15)
# A tibble: 12 x 5
SUBJID WT DOSE TIME CONC
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 79.6 4.02 0 0.74
2 1 79.6 4.02 0.25 2.84
3 1 79.6 4.02 0.570 6.57
4 1 79.6 4.02 1.12 10.5
5 1 79.6 4.02 2.02 9.66
6 1 79.6 4.02 3.82 8.58
7 2 72.4 4.4 0 0
8 2 72.4 4.4 0.27 1.72
9 2 72.4 4.4 0.52 7.91
10 2 72.4 4.4 1 8.31
11 2 72.4 4.4 1.92 8.33
12 2 72.4 4.4 3.5 6.85
>
> concentration_nested <- concentration_original %>% nest(data = c(TIME, CONC))
> concentration_nested
# A tibble: 2 x 4
SUBJID WT DOSE data
<dbl> <dbl> <dbl> <list>
1 1 79.6 4.02 <tibble [6 × 2]>
2 2 72.4 4.4 <tibble [6 × 2]>
>
> concentration_unnested <- unnest(concentration_nested, cols = c(data))
> print(concentration_unnested, n=15)
# A tibble: 12 x 5
SUBJID WT DOSE TIME CONC
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 79.6 4.02 0 0.74
2 1 79.6 4.02 0.25 2.84
3 1 79.6 4.02 0.570 6.57
4 1 79.6 4.02 1.12 10.5
5 1 79.6 4.02 2.02 9.66
6 1 79.6 4.02 3.82 8.58
7 2 72.4 4.4 0 0
8 2 72.4 4.4 0.27 1.72
9 2 72.4 4.4 0.52 7.91
10 2 72.4 4.4 1 8.31
11 2 72.4 4.4 1.92 8.33
12 2 72.4 4.4 3.5 6.85
>
> if (identical(concentration_unnested, concentration_original)) {
+ print("After nest/unnest, we have a dataframe which IS IDENTICAL to the original")
+ } else {
+ print("After nest/unnest, we have a dataframe which IS NOT IDENTICAL to the original")
+ }
[1] "After nest/unnest, we have a dataframe which IS NOT IDENTICAL to the original"
>
> all.equal(concentration_unnested, concentration_original)
[1] "Attributes: < Length mismatch: comparison on first 2 components >"
>
Note that I am using all.equal in order to see the problem may have something to do with attributes. If I use all_equal instead, the result is TRUE but I am still stuck with the identical function saying the dataframes are not the same. Thanks for any help with this!
Added dput of original df and nested/unnested df.
> dput(concentration_original)
structure(list(SUBJID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
WT = c(79.6, 79.6, 79.6, 79.6, 79.6, 79.6, 72.4, 72.4, 72.4,
72.4, 72.4, 72.4), DOSE = c(4.02, 4.02, 4.02, 4.02, 4.02,
4.02, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4), TIME = c(0, 0.25, 0.57,
1.12, 2.02, 3.82, 0, 0.27, 0.52, 1, 1.92, 3.5), CONC = c(0.74,
2.84, 6.57, 10.5, 9.66, 8.58, 0, 1.72, 7.91, 8.31, 8.33,
6.85)), spec = structure(list(cols = list(SUBJID = structure(list(), class = c("collector_double",
"collector")), WT = structure(list(), class = c("collector_double",
"collector")), DOSE = structure(list(), class = c("collector_double",
"collector")), TIME = structure(list(), class = c("collector_double",
"collector")), CONC = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
> dput(concentration_unnested)
structure(list(SUBJID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
WT = c(79.6, 79.6, 79.6, 79.6, 79.6, 79.6, 72.4, 72.4, 72.4,
72.4, 72.4, 72.4), DOSE = c(4.02, 4.02, 4.02, 4.02, 4.02,
4.02, 4.4, 4.4, 4.4, 4.4, 4.4, 4.4), TIME = c(0, 0.25, 0.57,
1.12, 2.02, 3.82, 0, 0.27, 0.52, 1, 1.92, 3.5), CONC = c(0.74,
2.84, 6.57, 10.5, 9.66, 8.58, 0, 1.72, 7.91, 8.31, 8.33,
6.85)), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
>
Additional information: I think I found the problem. The spec= info on the original tibble contains information related to when the tibble was created with read_csv. When the tibble goes through the nest/unnest transformation, the spec= info has been discarded. There is another thread which mentions the spec= info becoming out of sync with the content of the tibble: Remove attributes from data read in readr::read_csv. In that case they suggest removing the spec= attribute:
attr(df, 'spec') <- NULL
Based on what I was able to find, the reason that your original dataframe is not identical to the output is that the original is of the class
col_spec
and the output is not.Using the new
waldo
package, part oftidyverse
, I ran the following:It appears that you read in the data using
readr
and the resulting df was an object of classcol_spec
. Nesting the original df removes this attribute.Therefore, when you
unnest
, the df's are not identical.