Given a large array of tuples, how to groupby the first element of each tuple in order to sum the last element of each tuple without Pandas dataframe?

275 views Asked by At

I have a large list of tuples where each tuple contains 9 string elements:

pdf_results = [
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/18/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/18/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/19/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/19/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/20/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/20/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/21/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/21/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/23/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/24/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/24/22', 'SMI', '5', '0', '10', '5'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/18/22', 'RC', '8', '0', '16', '8'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/18/22', 'SMI', '5', '0', '10', '5'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/19/22', 'RC', '8', '0', '16', '8'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/19/22', 'SMI', '5', '0', '10', '5')
]

Without using a Pandas dataframe, how best to group by the first element of each tuple in order to sum the last element of each tuple. Output should look like this:

desired_output = [
("Kohl's - Dallas", 70),
("Bronx-Lebanon Hospital Center", 26)
]

I've tried using itertools.groupby which seems to be the most appropriate solution; however, getting stuck on properly iterating, indexing, and summing the last element of each tuple without running into one of the following obstacles:

  1. The last element of each tuple is of type string and upon converting to int prevents iteration as TypeError: 'int' object not iterable
  2. ValueError is raised where invalid literal for int() with base 10: 'b'

Attempt:

from itertools import groupby

def getSiteName(siteChunk):
    return siteChunk[0]

siteNameGroup = groupby(pdf_results, getSiteName)

for key, group in siteNameGroup:
    print(key) # 1st element of tuple as desired
    for pdf_results in group:
        # Raises TypeError: unsupported operand type(s) for +: 'int' and 'str'
        print(sum(pdf_results[8]))
    print()
4

There are 4 answers

0
dawg On BEST ANSWER

Assuming your list is sorted by the first element, you can do:

from itertools import groupby 

for k,v in groupby(pdf_results, key=lambda t: t[0]):
    print(k, sum(int(x[-1]) for x in v))

Prints:

Kohl's - Dallas 70
Bronx-Lebanon Hospital Center 26

If the order is not sorted, just use a dict to total the elements keyed by the the first entry of the tuple:

res={}

for t in pdf_results:
    res[t[0]]=res.get(t[0],0)+int(t[-1])

>>> res
{"Kohl's - Dallas": 70, 'Bronx-Lebanon Hospital Center': 26}
0
TheFaultInOurStars On

Why not using a simple for loop on a empty dictionary?

resultDict = {}
for value in pdf_results:
  if value[0] not in resultDict:
    resultDict[value[0]] = 0
  resultDict[value[0]] += float(value[len(value)-1])
print(resultDict)

Output

{"Kohl's - Dallas": 70.0,
'Bronx-Lebanon Hospital Center': 26.0}

If a dictionary is not what you want and you are insisting on having a tuple instead, you can use:

list(resultDict.items())

Output

[("Kohl's - Dallas", 70.0), ('Bronx-Lebanon Hospital Center', 26.0)]
0
Kelly Bundy On

You're almost there. Just change your

for pdf_results in group:
    print(sum(pdf_results[8]))

to:

print(sum(int(pdf_results[8])
          for pdf_results in group))

(Though I'd also rename to pdf_result, singular.)

0
Prashanth On

This would also work:

from collections import defaultdict

output = defaultdict(int)

for item in pdf_results:
    output[item[0]] += int(item[-1])

print(list(output.items()))

Output

[("Kohl's - Dallas", 70), ('Bronx-Lebanon Hospital Center', 26)]