Sum selected elements in dict of dicts in Python using one liner instead of for-loop

67 views Asked by At

I used the below dict comprehension

dimer = {(ab+cd):{"1":0,"2":0,"3":0} for cd in 'ACGT' for ab in 'ACGT'}

to generate a dict of dicts,dimer:

dimer = {"AA":{"1":0,"2":0,"3":0}, "AC":{"1":0,"2":0,"3":0}, "AG":{"1":0,"2":0,"3":0}, "AT":{"1":0,"2":0,"3":0}, "CA":{"1":0,"2":0,"3":0}, "CC":{"1":0,"2":0,"3":0}, "CG":{"1":0,"2":0,"3":0}, "CT":{"1":0,"2":0,"3":0}, "GA":{"1":0,"2":0,"3":0}, "GC":{"1":0,"2":0,"3":0}, "GG":{"1":0,"2":0,"3":0}, "GT":{"1":0,"2":0,"3":0}, "TA":{"1":0,"2":0,"3":0}, "TC":{"1":0,"2":0,"3":0}, "TT":{"1":0,"2":0,"3":0}, "TG":{"1":0,"2":0,"3":0}}

However, now I want to sum up selected elements,

If I hardcode them out, it would be like

total_A = dimer["AA"]["1"]+dimer["CA"]["1"]+dimer["GA"]["1"]+dimer["TA"]["1"]+dimer["AA"]["2"]+dimer["CA"]["2"]+dimer["GA"]["2"]+dimer["TA"]["2"]+dimer["AA"]["3"]+dimer["CA"]["3"]+dimer["GA"]["3"]+dimer["TA"]["3"]
total_C = dimer["AC"]["1"]+dimer["CC"]["1"]+dimer["GC"]["1"]+dimer["TC"]["1"]+dimer["AC"]["2"]+dimer["CC"]["2"]+dimer["GC"]["2"]+dimer["TC"]["2"]+dimer["AC"]["3"]+dimer["CC"]["3"]+dimer["GC"]["3"]+dimer["TC"]["3"]
total_G = dimer["AG"]["1"]+dimer["CG"]["1"]+dimer["GG"]["1"]+dimer["TG"]["1"]+dimer["AG"]["2"]+dimer["CG"]["2"]+dimer["GG"]["2"]+dimer["TG"]["2"]+dimer["AG"]["3"]+dimer["CG"]["3"]+dimer["GG"]["3"]+dimer["TG"]["3"]
total_T = dimer["AT"]["1"]+dimer["CT"]["1"]+dimer["GT"]["1"]+dimer["TT"]["1"]+dimer["AT"]["2"]+dimer["CT"]["2"]+dimer["GT"]["2"]+dimer["TT"]["2"]+dimer["AT"]["3"]+dimer["CT"]["3"]+dimer["GT"]["3"]+dimer["TT"]["3"]

The best approach I have come up with to simplify it is using nested for-loops:

total_0 = {i:0 for i in 'ACGT'}   
for i in 'ACGT':    
    for j in 'ACGT':
        for k in '123':
            total_0[i] += dimer[j+i][k]  

I was wondering if there is any method to sum them up using a one liner?

I also have another nested for-loops:

row_sum = {i:{"1":0,"2":0,"3":0} for i in 'ACGT'}   
for i in 'ACGT':    
    for j in 'ACGT':
        for k in '123': 
            row_sum[i][k] += float(dimer[i+j][k])

The hardcode version is like:

row_sum = {"A":{"1":0,"2":0,"3":0},"C":{"1":0,"2":0,"3":0},"G":{"G":0,"2":0,"3":0},"T":{"1":0,"2":0,"3":0}} 
for i in range(1,4,1): 
    row_sum["A"][str(i)] = float(dimer["AA"][str(i)]+dimer["AC"][str(i)]+dimer["AG"][str(i)]+dimer["AT"][str(i)])
    row_sum["C"][str(i)] = float(dimer["CA"][str(i)]+dimer["CC"][str(i)]+dimer["CG"][str(i)]+dimer["CT"][str(i)])
    row_sum["G"][str(i)] = float(dimer["GA"][str(i)]+dimer["GC"][str(i)]+dimer["GG"][str(i)]+dimer["GT"][str(i)])
    row_sum["T"][str(i)] = float(dimer["TA"][str(i)]+dimer["TC"][str(i)]+dimer["TG"][str(i)]+dimer["TT"][str(i)])

I am also wondering if there is any method to sum the second nested for-loop up using a one liner?

Sorry I am really new to Python. Any help will be appreciated!

2

There are 2 answers

2
John La Rooy On BEST ANSWER

Firstly, you can collapse the 3 loops into one using a cartesian product like this.

from itertool import product
row_sum = {i: {"1": 0, "2": 0, "3": 0} for i in NT}   
for i, j, k in product('ACGT', 'ACGT', '123'):    
    row_sum[i][k] += float(dimer[i + j][k])

Here is a one liner, but it's probably hard for you to follow if you are new to Python

{i: sum(sum(dimer[i + j].values()) for j in 'ACGT') for i in 'ACGT'}
0
amaurea On

I don't know how this would mesh with the rest of your program, but it may be worth switching to a different data structure. If you represent your collection of dimers as a single numpy array of integers, you will have your oneliners, and also see large speedups. Your dimers could for example be represented like this:

import numpy as np
dimer = np.zeros((4,4,3),dtype=int)

Here 0,1,2,3 in the first index represents whether the first element in the dimer is A,C,G,T, and similarly for the second index, while the third index contains the three different cases you label "1", "2" or "3". So your dimer["AG"]["1"] would here be dimer[0,2,0] (since numpy counts from zero).

The advantage of using a structure like this is

  1. It is much faster and more memory efficient if your data set becomes large (if you have 300000 elements per dimer instead of 3, for example).
  2. There are lots of functions available for manipluating numpy arrays. For example, np.sum(dimer,2) would give you the total count of the elements of each dimer.

The aggregate statistics you want could be computed as:

total_0 = np.sum(dimer, (0,2))
row_sum = np.sum(dimer, 1)

As an illustration of the speed differences, for your problem size, the the dict approach with for loops takes 20 µs to compute total_0, while the numpy sum takes 5.7 µs. For a 1000 times larger problem, where each dimer has 3000 members, the dict approach takes 22 ms while numpy takes 31 µs. And for a 1,000,000 times larger problem dicts take 24.5 s while numpy takes 24.3 ms. So for large problem sizes, numpy is 1000 times faster than using dicts.