R: How do you summarize data for both leafs and nodes in Data.Tree?

695 views Asked by At

I am using the data.tree structure to summarize various information across file-folders. In each folder I have a number of files (Value), and what I need to do for each folder is to summarise how many files the folder + all subfolders contain.

Example data:

library(data.tree)
data <- data.frame(pathString = c("MainFolder",
                                  "MainFolder/Folder1",
                                  "MainFolder/Folder2",
                                  "MainFolder/Folder3",
                                  "MainFolder/Folder1/Subfolder1",
                                  "MainFolder/Folder1/Subfolder2"),
                   Value = c(1,1,5,2,4,10))
tree <- as.Node(data, Value)
print(tree, "Value")
               levelName Value
1 MainFolder             1
2  ¦--Folder1            1
3  ¦   ¦--Subfolder1     4
4  ¦   °--Subfolder2    10
5  ¦--Folder2            5
6  °--Folder3            2

My current and VERY SLOW solution to the problem:

# Function to sum up file counts pr folder + subfolders
total_count <- function(node) {
  results <- sum(as.data.frame(print(node, "Value"))$Value)
  return(results)
}

# Summing up file counts pr folder + subfolders
tree$Do(function(node) node$Value_by_folder <- total_count(node))


# Results
print(tree, "Value", "Value_by_folder")
           levelName Value Value_by_folder
1 MainFolder             1              23
2  ¦--Folder1            1              15
3  ¦   ¦--Subfolder1     4               4
4  ¦   °--Subfolder2    10              10
5  ¦--Folder2            5               5
6  °--Folder3            2               2

Do you have a suggestion of how to do this more efficiently? I have been attempting to build a recursive method, and also to use the functions "isLeaf" and "children" on the nodes, but have not been able to make it work.

2

There are 2 answers

5
Christoph Glur On BEST ANSWER

This is an efficient way to do this. It uses the data.tree API and stores the value in the tree:

MyAggregate <- function(node) {
  if (node$isLeaf) return (node$Value)
  sum(Get(node$children, "Value_by_folder")) + node$Value
}

tree$Do(function(node) node$Value_by_folder <- MyAggregate(node), traversal = "post-order")
2
F. Privé On

You can do:

get_value_by_folder <- function(tree) {

  res <- rep(NA_real_, tree$totalCount)

  i <- 0
  myApply <- function(node) {
    i <<- i + 1
    force(k <- i)
    res[k] <<- node$Value + `if`(node$isLeaf, 0, sum(sapply(node$children, myApply)))
  }

  myApply(tree)
  res
}

The force is important because lazy evaluation of R messes up with the order you want to fill res.

And you get:

> get_value_by_folder(tree)
[1] 23 15  4 10  5  2

Edit: if you want to fill it in the tree directly.

get_value_by_folder2 <- function(tree) {

  myApply <- function(node) {
    node$Value_by_folder <- node$Value + `if`(node$isLeaf, 0, sum(sapply(node$children, myApply)))
  }

  myApply(tree)
  tree
}

> print(get_value_by_folder2(tree), "Value", "Value_by_folder")
           levelName Value Value_by_folder
1 MainFolder             1              23
2  ¦--Folder1            1              15
3  ¦   ¦--Subfolder1     4               4
4  ¦   °--Subfolder2    10              10
5  ¦--Folder2            5               5
6  °--Folder3            2               2

Note that the class is an environment so that the original tree is modified.

> print(tree, "Value", "Value_by_folder")
           levelName Value Value_by_folder
1 MainFolder             1              23
2  ¦--Folder1            1              15
3  ¦   ¦--Subfolder1     4               4
4  ¦   °--Subfolder2    10              10
5  ¦--Folder2            5               5
6  °--Folder3            2               2