flatten a unbalanced(ragged) hierarchy

82 views Asked by At

So I have a .csv file that displays a ragged hierarchy vertically. The indent_nbr indicates the levels each item is at in the hierarchy, with 0 being the top parent.

   item indent_nbr
1     A          0
2     B          1
3     C          2
4     D          3
5     E          4
6     F          4
7     G          4
8     H          5
9     I          5
10    J          5
11    K          5
12    L          4
13    M          5
14    N          5
15    O          5
16    P          5
17    Q          3
18    R          4

I want to flatten this hierarchy to look like this matrix.

      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,] "A"  "B"  "C"  "D"  "E"  NA  
 [2,] "A"  "B"  "C"  "D"  "F"  NA  
 [3,] "A"  "B"  "C"  "D"  "G"  "H" 
 [4,] "A"  "B"  "C"  "D"  "G"  "I" 
 [5,] "A"  "B"  "C"  "D"  "G"  "J" 
 [6,] "A"  "B"  "C"  "D"  "G"  "K" 
 [7,] "A"  "B"  "C"  "D"  "L"  "M" 
 [8,] "A"  "B"  "C"  "D"  "L"  "N" 
 [9,] "A"  "B"  "C"  "D"  "L"  "O" 
[10,] "A"  "B"  "C"  "D"  "L"  "P" 
[11,] "A"  "B"  "C"  "Q"  "R"  NA  

Can someone help me with this?

please note that I'm limited to using the following packages: base, boot, class, cluster, codetools, compiler datasets, foreign, graphics, grDevices, grid, Kernsmooth, lattice, MASS, Matrix, methods, mgcv, nlme, nnet, parallel, rpart, spatial, splines, stats, stats4, survival, tcltk, tools, translations, utils

1

There are 1 answers

0
jay.sf On

To turn hierarchical data into a matrix, we could first make groups g according to where the hierarchy changes. Then, we create an array a sized according to these groups and levels. Next, we place the items in this array based on where they fall in the hierarchy. This way we get the first complete sequence and end points of the following, thus leaving NAs which can be filled column-wise with last non-NA using Ruben's repeat_last, so no extra packages are needed. However this overwrites true NAs which we store beforehand in na_ind and recover afterwards.

> hrr2mat <- \(dat) {
+   g <- c(0, cumsum(diff(dat$indent_nbr) != 1))
+   a <- array(dim=c(length(table(g)), length(table(dat$indent_nbr))))
+   a[cbind(g + 1, dat$indent_nbr + 1)] <- dat$item
+   na <- apply(!is.na(a), 1, \(x) max(cumsum(diff(x) >= 0) + 1)) + 1
+   w <- which(na <= ncol(a))
+   na_ind <- t(mapply(cbind, w, lapply(na[w], `:`, ncol(a))))
+   a <- apply(a, 2, repeat_last)
+   a[na_ind] <- NA
+   return(a)
+ }
> hrr2mat(dat)
      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,] "A"  "B"  "C"  "D"  "E"  NA  
 [2,] "A"  "B"  "C"  "D"  "F"  NA  
 [3,] "A"  "B"  "C"  "D"  "G"  "H" 
 [4,] "A"  "B"  "C"  "D"  "G"  "I" 
 [5,] "A"  "B"  "C"  "D"  "G"  "J" 
 [6,] "A"  "B"  "C"  "D"  "G"  "K" 
 [7,] "A"  "B"  "C"  "D"  "L"  "M" 
 [8,] "A"  "B"  "C"  "D"  "L"  "N" 
 [9,] "A"  "B"  "C"  "D"  "L"  "O" 
[10,] "A"  "B"  "C"  "D"  "L"  "P" 
[11,] "A"  "B"  "C"  "Q"  "R"  NA  

Not sure how it scales but might be a start.


Data:

> dput(dat)
structure(list(item = c("A", "B", "C", "D", "E", "F", "G", "H", 
"I", "J", "K", "L", "M", "N", "O", "P", "Q", "R"), indent_nbr = c(0, 
1, 2, 3, 4, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 5, 3, 4)), class = "data.frame", row.names = c(NA, 
-18L))