How to obtain files included in initial commit using git2r (libgit2)?

496 views Asked by At

I am using the R package git2r to interface with libgit2. I would like to obtain the list of files that were updated in each commit, similar to the output from git log --stat or git log --name-only. However, I am unable to obtain the files that were included in the initial commit. Below I provide code to setup an example Git repository as well as my attempted solutions based on my research.

Reproducible example

The code below creates a temporary directory in /tmp, creates empty text files, and then commits each file separately.

# Create example Git repo
path <- tempfile("so-git2r-ex-")
dir.create(path)
setwd(path)
# Set the number of fake files
n_files <- 3
file.create(paste0("file", 1:n_files, ".txt"))
library("git2r")
repo <- init(".")
for (i in 1:n_files) {
  add(repo, sprintf("file%d.txt", i))
  commit(repo, sprintf("Added file %d", i))
}

Option 1 - compare diff of two trees

This SO post recommends you perform a diff comparing the tree object of the desired commit and its parent commit. This works well, except for the initial commit because there is no parent commit to compare it to.

get_files_from_diff <- function(c1, c2) {
  # Obtain files updated in commit c1.
  # c2 is the commit that preceded c1.
  git_diff <- diff(tree(c1), tree(c2))
  files <- sapply(git_diff@files, function(x) x@new_file)
  return(files)
}

log <- commits(repo)
n <- length(log)
for (i in 1:n) {
  print(i)
  if (i == n) {
    print("Unclear how to obtain list of files from initial commit.")
  } else {
    files <- get_files_from_diff(log[[i]], log[[i + 1]])
    print(files)
  }
}

Option 2 - Parse commit summary

This SO post suggests obtaining commit information like the files changed by parsing the commit summary. This gives very similar to git log --stat, but again the exception is the initial commit. It lists no files. Looking at the source code, the files in the commit summary are obtained via the same method above, which explains why no files are displayed for the initial commit (it has no parent commit).

for (i in 1:n) {
  summary(log[[i]])
}

Update

This should be possible. The Git command diff-tree has a flag --root to compare the root commit to a NULL tree (source). From the man page:

   --root
       When --root is specified the initial commit will be shown as a
       big creation event. This is equivalent to a diff against the
       NULL tree.

Furthermore, the libgit2 library has the function git_diff_tree_to_tree, which accepts a NULL tree. Unfortunately, it is unclear to me if it is possible to pass a NULL tree to the git2r C function git2r_diff via the git2r diff method for git-tree objects. Is there a way to create a NULL tree object with git2r?

> tree()
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘tree’ for signature ‘"missing"’
> tree(NULL)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘tree’ for signature ‘"NULL"’
1

There are 1 answers

0
John Blischak On

I came up with a solution based on the insight from my colleague that you can obtain the files currently being tracked by inspecting the git_tree object. This shows all the files that have been tracked up to this point, but since the root commit is the first commit, this means these files had to be added in that commit.

The summary method prints the files, and this data frame can be captured using the as method.

summary(tree(log[[n]]))
#    mode type                                      sha      name
# 1 100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 file1.txt
as(tree(log[[n]]), "data.frame")
#    mode type                                      sha      name
# 1 100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 file1.txt

The function below obtains the files from the root commit. While it is not apparent in this small example, the main complication is that subdirectories are represented as trees, so you need to recursively search the tree to obtain all the filenames.

obtain_files_in_commit_root <- function(repo, commit) {
  # Obtain the files in the root commit of a Git repository
  stopifnot(class(repo) ==  "git_repository",
            class(commit) == "git_commit",
            length(parents(commit)) == 0)
  entries <- as(tree(commit), "data.frame")
  files <- character()
  while (nrow(entries) > 0) {
    if (entries$type[1] == "blob") {
      # If the entry is a blob, i.e. file:
      #  - record the name of the file
      #  - remove the entry
      files <- c(files, entries$name[1])
      entries <- entries[-1, ]
    } else if (entries$type[1] == "tree") {
      # If the entry is a tree, i.e. subdirectory:
      #  - lookup the entries for this tree
      #  - add the subdirectory to the name so that path is correct
      #  - remove the entry from beginning and add new entries to end of
      #    data.frame
      new_tree_df <- as(lookup(repo, entries$sha[1]), "data.frame")
      new_tree_df$name <- file.path(entries$name[1], new_tree_df$name)
      entries <- rbind(entries[-1, ], new_tree_df)
    } else {
      stop(sprintf("Unknown type %s found in commit %s",
                   entries$type[1], commit))
    }
  }

  return(files)
}

obtain_files_in_commit_root(repo, log[[n]])
# [1] "file1.txt"