Sorting entire csv by frequency of occurence in one column

12.3k views Asked by At

I have a large CSV file, which is a log of caller data.

A short snippet of my file:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

I want to sort the entire list by the frequency of occurrence of customers so it will be like:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User

I've tried groupby, but that only prints out the Company Name and the frequency but not the other columns, I also tried

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

and

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

But these give me errors:

ValueError: The wrong number of items passed 1, indices imply 24

I've looked at something like this:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)

but this only prints out two columns, and I want to sort my entire CSV. My output should be my entire CSV sorted by the first column.

Thanks for the help in advance!

4

There are 4 answers

0
EdChum On BEST ANSWER

This seems to do what you want, basically add a count column by performing a groupby and transform with value_counts and then you can sort on that column:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)

Output:

  CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

You can drop the extraneous column using df.drop:

df.drop('count', axis=1)

Output:

  CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User
0
cel On

I think there must be a better way to do it, but this should work:

Preparing the data:

import io
data = """
CompanyName  HighPriority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")

And doing the transformation:

# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())

# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")

# output the original data frame in the order of the new index.
df.reindex(new_index.index)

The output:

    CompanyName HighPriority    QualityIssue
3   Customer3   No  Equipment
5   Customer3   No  User
6   Customer3   Yes User
7   Customer3   Yes Equipment
0   Customer1   Yes User
1   Customer1   Yes User
4   Customer1   No  Neither
8   Customer4   No  User
2   Customer2   No  User

It's probably not intuitive what happens here, but at the moment I cannot think of a better way to do it. I tried to comment as much as possible.

The tricky part here is that the index of count_df is the (unique) occurrences of the customers. Therefore, I join the index of count_df (left_index=True) with the CompanyName column of df (right_on="CompanyName").

The magic here is that count_df is already sorted by the number of occurrences, that's why we need no explicit sorting. So all we have to do is to reorder the rows of the original data frame by the rows of the joined data frame and we get the expected result.

1
AudioBubble On

Update 2021

The answers proposed by EdChum and Ilya K. does not work anymore.


The function pd.Series.value_counts returns a Series with the counts of unique values. But the Series that we apply the pd.Series.value_counts function to itself contain only one unique value due to the fact that we applied groupby to the DataFrame and split the CompanyName Series into groups of unique values earlier. Thus, the final output after we apply the function would look something like this.

Customer3        4
dtype: int64

Which is nonsense, we cannot transform a value in a Series into a whole Series. Somehow, we only need the integer 4 and not the whole Series.


However, we can take advantage of the groupby function earlier by counting the number of values in each group, transforming the whole groups into the number of values in that group, and put them together into a final Frequency Series.

We can replace pd.Series.value_counts with pd.Series.count or just simply use the function name count

import pandas as pd

df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})

df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values(['Frequency', 'CompanyName'], inplace=True, ascending=[False, True])

Output

>>> df

  CompanyName HighPriority QualityIssue  Frequency
3   Customer3           No    Equipment          4
5   Customer3           No         User          4
6   Customer3          Yes         User          4
7   Customer3          Yes    Equipment          4
0   Customer1          Yes         User          3
1   Customer1          Yes         User          3
4   Customer1           No      Neither          3
2   Customer2           No         User          1
8   Customer4           No         User          1
0
Ilya K. On

The top-voted answer needs a minor addition: sort was deprecated in favour of sort_values and sort_index.

sort_values will work like this:

    import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
    df['count'] = \
    df.groupby('a')['a']\
    .transform(pd.Series.value_counts)
    df.sort_values('count', inplace=True, ascending=False)
    print('df sorted: \n{}'.format(df))
df sorted:
a  b  count
0  1  1      2
2  1  3      2
1  2  2      1