How can I make this combination/clique algorithm more memory and time efficient, perhaps through multiprocessing?

42 views Asked by At

Currently, I have a program that successfully converts a list of tuple pairs into triplets in which everything within the triplet is paired with each other. Here is the fully working Python program so that you can use it for testing:

import itertools
from itertools import combinations
from tqdm import tqdm
import multiprocessing
import time

tuplelist = [
    ("A", "A"),
    ("B", "B"),
    ("C", "C"),
    ("D", "D"),
    ("E", "E"),
    ("A", "B"),
    ("A", "C"),
    ("A", "D"),
    ("B", "C"),
    ("B", "D"),
    ("C", "D"),
    ("C", "E"),
    ("B", "E"),
    ("F", "F"),
    ("A", "F"),
    ("B", "F"),
    ("E", "F"),
    ("G", "G"),
    ("A", "G"),
    ("C", "G"),
    ("F", "G"),
    ("C", "F"),
    ("H", "H"),
    ("A", "H"),
    ("B", "H"),
    ("C", "H"),
    ("D", "H"),

]

print("Now making uniquegene list.")
uniquegenes = list(set(itertools.chain.from_iterable(tuplelist)))
uniquegenes.sort()
print(uniquegenes)
newpairs = []
print("Lexographically sorting pairs now.")
for x in tuplelist:
    temp = sorted(x)
    newpairs.append(tuple(temp))
tuplelist = newpairs
print(len(tuplelist))
print("Now making combinations.")

#------------------------
def combinator(size, theuniquegenes, thetuplelist):
    limiter = size+1
    lengths = [l + 1 for l in tqdm(range(len(theuniquegenes))) if (l + 1 > 2) and (l + 1 < limiter)]
    new_combs = {c for l in tqdm(lengths) for c in combinations(theuniquegenes, l)}
    correlations = {x for x in tqdm(new_combs) if all([c in thetuplelist for c in combinations(x, 2)])}
    tuplelist = thetuplelist + list(correlations)
    return tuplelist
tuplelist = combinator(3, uniquegenes, tuplelist)
print(len(tuplelist))
#------------------------

print("Writing to file.")
with open('NewCombos.txt', 'w') as f:
    for line in tuplelist:
        f.write(f"{line}\n")

This turns 27 pairs into 46 pairs AND triplets (so 19 new triplets). The only relevant code I would like to edit is in between the two lines I commented into the code.

The issue with this code, however, is that if I were to input a massive amount of pairs, the RAM used in the "new_combs =" line is unfeasible. This is because every single triplet imaginable is first formed, then saved to a variable, and only then is it analyzed one by one in the "correlations =" line to check if everything within the triplet correlates to each other and eliminates the triplets that don't have this correlation. tqdm() is only a progress checker.

How can I go about rewriting this function such that every time it generates a new possible triplet combination, it checks whether it has internal correlation, and then moves on? While this is happening, I would like the new program to no longer build up massive RAM usage. Anything is fair game including offloading the memory to a file through pickle or perhaps by using multithreading to do multiple parts at the same time.

Here is my attempt at making such a program, though it it very faulty and probably should not be referenced in your attempt. (This is meant to replace only the area between the commented lines in the original program above.)

def worker(task):
    (length, theuniquegenes, thetuplelist) = task
    limiter = length+1
    new_combs = {c for l in range(length, length+1) if (l > 2) and (l < limiter)
                 for c in combinations(theuniquegenes, l)}
    correlations = {x for x in new_combs if all([c in thetuplelist for c in combinations(x, 2)])}
    return correlations

def combinator(size, theuniquegenes, thetuplelist):
    limiter = size+1
    lengths = [l + 1 for l in range(len(theuniquegenes)) if (l + 1 > 2) and (l + 1 < limiter)]
    tasks = [(length, theuniquegenes, thetuplelist) for length in lengths]
    
    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        result_list = pool.imap(worker, tasks)
    
    correlations = set().union(*result_list)
    print(len(correlations))
    tuplelist = thetuplelist + list(correlations)
    print(len(tuplelist))
    return tuplelist

tuplelist = combinator(3, uniquegenes, tuplelist)

Any help would be highly appreciated.

1

There are 1 answers

0
danzel On

Because you only iterate it once, you can save the RAM (and a second iteration) for new_combs by omitting the intermediate set.

If you use a generator instead:

new_combs = (c for l in tqdm(lengths) for c in combinations(theuniquegenes, l))

...there will only be one new combination in memory at a time.

Additionally, the following line creates an unnecessary list as argument to all:

correlations = {x for x in tqdm(new_combs) if all([c in thetuplelist for c in combinations(x, 2)])}

all can short-circuit, meaning that it returns False as soon as it sees the first False element. Omitting the list comprehension saves memory for the list and CPU resources for evaluating all combinations:

correlations = {x for x in tqdm(new_combs) if all(c in thetuplelist for c in combinations(x, 2))}