Challenges with Fitting PGLS Models on Binary Trait Subsets Using caper in R

15 views Asked by At

I'm analyzing phylogenetic signal in binary trait data (presence/absence of a gene) across different subsets of a phylogenetic tree. My goal is to calculate Pagel's lambda for these subsets to compare phylogenetic signal strengths under different conditions. While the full dataset works fine, I'm encountering errors when fitting PGLS models on subsets using the caper package in R. The subsets are binary trait data indicating the presence or absence of a specific gene across various genomes.

Here's a condensed version of my code, illustrating the process up to the point where I attempt to fit the PGLS model:

library(caper)
library(ape)

# Setup not shown for brevity: 'tree' is my phylogenetic tree, 'gene_data' contains binary traits

random_sampling_lambda <- function(tree, gene_data, subset_size = 41, n_iterations = 100) {
  lambda_values <- numeric(n_iterations)
  
  for (i in 1:n_iterations) {
    sample_indices <- sample(nrow(gene_data), subset_size, replace = FALSE)
    sampled_data <- gene_data[sample_indices, ]
    
    # Assuming matching between sampled IDs and tree tips is ensured
    pruned_tree <- drop.tip(tree, setdiff(tree$tip.label, sampled_data$Genome_ID))
    comp_data <- comparative.data(phy = pruned_tree, data = sampled_data, names.col = "Genome_ID")
    
    # Attempting to fit PGLS model
    model <- pgls(Gene_1 ~ 1, data = comp_data, lambda = "ML")
    
    # Storing lambda value if model fitting is successful
    if (!is.null(model) && "lambda" %in% names(model) && !is.na(model$lambda)) {
      lambda_values[i] <- model$lambda
    } else {
      lambda_values[i] <- NA
    }
  }
  
  return(lambda_values[!is.na(lambda_values)])
}

# Sample function call (assuming 'tree' and 'gene_data' are defined)
lambda_values <- random_sampling_lambda(tree, gene_data, subset_size = 41, n_iterations = 100)

Sample data for reproducible example:

# Sample gene presence/absence data
gene_data <- structure(list(Genome_ID = c("Genome1", "Genome2", "Genome3", "Genome4"), 
                            Gene_1 = c(1, 0, 1, 0)), 
                       .Names = c("Genome_ID", "Gene_1"), 
                       row.names = c(NA, -4L), 
                       class = "data.frame")

# Sample tree in Newick format
tree_newick <- "((Genome1:0.1,Genome2:0.2):0.3,(Genome3:0.1,Genome4:0.2):0.3);"

Issue: When executing the code, especially during the PGLS model fitting with pgls, I encounter errors indicating problems with the input data. These errors suggest "arguments of length zero" or issues related specifically to handling binary (presence/absence) data in the model.

Question: How can I resolve these errors to successfully compute Pagel's lambda for subsets of binary trait data using caper? Any advice on dealing with binary traits in this phylogenetic analysis context would be greatly appreciated.

What I've tried:

  1. Ensuring each subset contains both presences (1) and absences (0) to maintain variability.
  2. Manually verifying that the pruned trees have the correct size and match the subsets.
  3. Fitting a PGLS model outside the function on the entire dataset without issues, indicating the problem arises with subsets.

Expected Outcome: A series of lambda values for each subset, indicating the phylogenetic signal.

Actual Outcome: Errors during the model fitting process, hindering lambda calculation.

0

There are 0 answers