Categorizing data, compute average and standard deviation of each category

3.6k views Asked by At

I'm writing a code to categorize the datas, and get the average and standard deviation. Here are the example of my data.

3917 1 -0.662261 25.148 22.9354 68.8076 
3918 1 12.7649 18.7451 7.68473 69.0063 
3919 1 -9.56836 -23.3265 -61.953 68.8357 
3920 1 11.6292 31.6525 -29.3697 69.1372 
3921 2 26.4837 -66.7897 12.0257 69.2282 
3922 1 -9.81652 14.3788 9.38343 69.1217 
3923 2 39.931 -88.1879 109.498 69.1604 
3924 1 4.5502 3.53887 -6.59604 69.486 
3925 2 13.6801 -24.6628 -5.7568 69.9398 
3926 1 -10.5635 7.05517 -8.82785 70.2263

As you can see, there are 6 columns. I'm thinking of 3 step calculation here.

  1. Categorize these numbers based on 6th column. 6th column is consist of float numbers from 0 ~ n. I hope to generate n sections (or sub matrices, or whatever), like 0~1, 1~2, 2~3 .... n-1 ~ n. The last number should be round up number of last data, because I hope to make sections. For example, if the last number is the 121.2513, the last section should be 120~121 to contain that data.

  2. Reallocate the all other numbers of column 1~5, to the their corresponding subsections based on 6th column. If there are no number in specific sections, just print it as 0. There will be n number of subsections. The number of elements in each subsections will be random.

  3. Get the average and standard deviation of 3th, 4th, and 5th column for each sub sections, and write to the output file with 'number of elements in subsection, beginning number of subsection, and avg and standard deviation of 3th, 4th, and 5th column'

I was trying this with multiple for loops, but it became too complex, and makes error. Is there any other easy way to categorize the data, play with each of the sub section, and print them out in Python? Also, my for loops are not working at all. Any simple example suggestion using this data?

3

There are 3 answers

2
user1470788 On

This task lends itself to the pandas library. (http://pandas.pydata.org/) From what I understood from your post, you wanted to compute the columnwise means and standard deviations. To compute the rowwise statistics, add the parameter, axis=1 to the mean and std functions. In the code below, the example has been saved to "tmp.txt'. The first step loads it; then it is simple to calculate statistics over the dataframe.

import pandas as pd
df =pd.read_csv('tmp.txt',sep=' ',header=None)
means = df.mean()
stds = df.std()

For more information about pandas, take a look at the quick introduction: http://pandas.pydata.org/pandas-docs/stable/10min.html

0
abarnert On

I'll assume you have your matrix in a list of lists, and show you how to get started. (However, as I mentioned in a comment, if you had your matrix in a numpy array or matrix, it would all be a lot easier—and faster, too.)


If you just have that value as a big multi-line string, you can convert it into a list of list of floats like this:

m = [[float(col) for col in line.split()] for line in s.splitlines()]

Now, by "categorize these numbers based on 6th column", it sounds you what you want is to group them by the integer value of that column.

Python comes with a groupby function that does most of what you want here, but you have to sort the data first.

In Python, sorting and grouping and related functions always let you pass a key function. You don't have to sort and group the 6th column, you can sort and group the whole row, using the 6th column as the key.

But actually, you don't want to use the value of the 6th column as a key, you want to use the integer value of the 6th column. For the former, you'd use the itemgetter function that comes in the stdlib, but to do something more complicated, you're better off writing a function for it:

def keyfunc(row):
    return int(row[5])
groups = groupby(sorted(data, key=keyfunc), key=keyfunc)

(If you're doing this repeatedly, you probably want to write a wrapper function that sorts then groups with the same key, so you don't have to repeat yourself, and get it wrong.)

What this gives you is an iterator with iterators inside, which is a little hard to print out:

[(k, list(g)) for k, g in groups]

… but what you get is:

[(68,
  [[3917.0, 1.0, -0.662261, 25.148, 22.9354, 68.8076],
   [3919.0, 1.0, -9.56836, -23.3265, -61.953, 68.8357]]),
 (69,
  [[3918.0, 1.0, 12.7649, 18.7451, 7.68473, 69.0063],
   [3920.0, 1.0, 11.6292, 31.6525, -29.3697, 69.1372],
   [3921.0, 2.0, 26.4837, -66.7897, 12.0257, 69.2282],
   [3922.0, 1.0, -9.81652, 14.3788, 9.38343, 69.1217],
   [3923.0, 2.0, 39.931, -88.1879, 109.498, 69.1604],
   [3924.0, 1.0, 4.5502, 3.53887, -6.59604, 69.486],
   [3925.0, 2.0, 13.6801, -24.6628, -5.7568, 69.9398]]),
 (70, [[3926.0, 1.0, -10.5635, 7.05517, -8.82785, 70.2263]])]

So, each k is the integer the category is grouped on, and each g is all of the rows in that category (in sorted order).

(Note that, because groups is an iterator, if you print this, groups will now be empty.)

And that takes care of your first problem, and I think most of your second (I'm not sure exactly what you wanted there).

For the third, first you need to iterate over the groups:

for k, g in groups:

If you're going to loop over the group multiple times, you want to make a list of it immediately, then loop over that list.

For each group, you want to do some statistical stuff to multiple columns. The simplest way to do that is by using a module that takes care of all the math for you. The stats module on PyPI is a good bet, especially now that (with slight changes) it's probably going to end up in the standard library for an upcoming Python version.

But I'll just show the mean, because that's trivial.

So:

def mean(sequence):
    return sum(sequence) / len(sequence)

for k, g in groups:
    rows = list(g)
    print(k)
    for column_index in 2, 3, 4:
        column = [row[column_index] for row in rows]
        print(mean(column))

There's probably a lot that this code doesn't do that you need, but hopefully it's enough to get you started, and to ask more specific questions when you get stuck.

2
l4mpi On

As others have said you might want to look into numpy or pandas, especially if your dataset is huge. But the task is easy to achieve in pure python as well with the right data structures. In this case I'd simply use a defaultdict for the sections, with the key being the 6th column rounded down, mapping to a tuple of lists (or a more sophisticated data structure) in which you can then collect the column values:

from collections import defaultdict
sections = defaultdict(lambda: ([], [], [], [], []))

with open("mydata.txt") as f:
    #parse all non-empty lines into a list of lists of floats
    lines = [x.strip() for x in f.read().split("\n")]
    data = [map(float, x.split(" ")) for x in lines if x]

for row in data:
    #get the corresponding section (and create it if it doesn't exist)
    section = sections[int(row[5])]
    for x in range(5):
        #append the column values to the respective lists
        section[x].append(row[x])

Now sections contains the data in an easily usable format; simply iterate over it and compute the average and standard deviation of the rows you're interested in:

for s, columns in sections.items():
    #assuming "mean" is a function that calculates the average of a list
    print "mean of section [%i,%i) column 3 is %f" % (s, s+1, mean(columns[2]))

Extending this to multiple/different columns and functions should be trivial.