I have a large undirected graph (about 150k nodes) which is rather sparse, and I want to get the list of all non-links from this graph (so the complement of the set of links that exist). To a first approximation I don't care whether I get this as an igraph object, a matrix, or as an edgelist, though in the end I'll have to transform it into an edgelist. The two approaches I found are neither memory-efficient nor fast, and I would appreciate help in how this can be done efficiently.
Approach 1: get_non_links()
from the LinkPrediction package by galanisl, Link: https://github.com/galanisl/LinkPrediction/blob/master/R/link_predictors.R
This approach takes an igraph object, transforms it into its adjacency matrix, sets the lower triangle of the matrix to 1, and checks which elements of the matrix are zero:
library(igraph)
edgelist <- data.frame(c("A", "B", "C", "A"), c("B", "E", "F", "C"))
g <- graph_from_data_frame(edgelist)
m <- as_adjacency_matrix(g, names = FALSE, sparse = FALSE)
m[lower.tri(m, diag = TRUE)] <- 1
nonlinks <- which(m == 0, arr.ind = TRUE)
This approach takes forever (>8 hours on my graph) and eats up quite a lot of memory.
Approach 2: Enumerating all possible links and finding out which ones exist
edgelist <- data.frame(c("A", "B", "C", "A"), c("B", "E", "F", "C"))
colnames(edgelist) <- c("node1", "node2")
nodes <- union(unique(edgelist[,1]), unique(edgelist[,2]))
possible_links <- expand.grid(nodes, nodes) #which does not make use of the fact that the graph is undirected
edgelist$links_exist <- 1
testlist <- merge(edgelist, possible_links, by.x =c("node1", "node2"), by.y = c("Var1", "Var2"), all.y = TRUE)
non_links <- testlist[is.na(testlist$links_exist), ]
I'm sure both of these approaches could be implemented more efficiently, but I'm not sure orders of magnitude more efficiently. Any hints how this could be done efficiently are much appreciated!
What you are looking for is called the complement graph. In igraph it can be computed with the
complementer()
function.Be careful if your graph is large and sparse. The size of the result will be quadratic in the number of vertices, and may easily fill up all available memory. As a general guideline, exercise caution with graphs that have more than 10000 vertices.