There is a column in my csv called cost, which I want to sum based on another column, called factory, to basically create a breakdown of cost by factory. I have rows such as the following, where there are multiple costs for each factory:
Factory,Cost,Cost_Type
Bali,23,0
Sydney,21,1
Sydney,4,2
Denver,8,1
Bali,9,1
I'd like to be able to quickly sum the cost per factory, and save these values to a variable. I think one way to do this is by looping through a list of factories, which then loops through the csv. Here is where I've got to:
factories= ['Bali', 'Sydney', 'Denver']
totalcost = 0
balicost = 0
sydneycost = 0
denvercost = 0
for factory in factories:
for row in csv.reader(costcsv):
if row[0] == factory:
Where I'm stuck is that I don't know how to change the variable which is being added to for the different factories, balicost, sydneycost and denvercost. The simplified version, where I'm just getting the total of the cost column was as follows:
for row in csv.reader(costcsv):
totalcost += float(row[1])
I'm more than open to different approaches than this (I believe dictionaries could come into it), and appreciate any points in the right direction.
[Community wiki, because it's a little tangential.]
When you're processing tabular data in Python, you should consider the
pandas
library. The operation you want to perform is a groupby sum, and that's easily done in two lines:which produces a
Series
object you can index into like a dictionary:Update, using the updated data-- if you also want to handle
Cost_Type
, you have several options. One is to select only the rows with Cost_Type == 1:or you can expand the groupby and group on both
Factory
andCost_Type
simultaneously: