awk - count number of occurences for a field in a line containing another specific field

51 views Asked by At

I have the following data structure:

apples    yellow
apples    yellow
apples    green
apples    green
apples    green

grapes    yellow
grapes    yellow
grapes    yellow
grapes    green

lemons    yellow
lemons    green
lemons    green

Important: I don't know my list contains apples, grapes and lemons beforehand. If I need to count the number of times $1 is yellow and then display $1 with the number of yellow counts next to it, I can do this with GNU AWK:

awk '$2=="yellow" {yellowfruit[$1]++} END {for (fruit in yellowfruit) print fruit,yellowfruit[fruit]}'

...and get the expected result:

grapes 3
lemons 1
apples 2

How can I add another column which counts green occurences for each fruit type? I can't do for (fruit in yellowfruit,greenfruit) or like bash: for (fruit in yellowfruit greenfruit)

2

There are 2 answers

0
one-liner On BEST ANSWER

I found my answer some time ago but never got around to post it here. It requires only one for loop and the conditional statements are clearer.

awk '{
all[$1]++
if ($2=="yellow") yellowfruit[$1]++
else if ($2="green") greenfruit[$1]++} END {for (fruit in all) print fruit,yellowfruit[fruit],greenfruit[fruit]}'

Result:

grapes 3 1
lemons 1 2
apples 2 3
1
Zilla On

You can be more generic, and handle any number of unknown color/fruit pairs like this:

awk '{if(NF==2){fruit[$2][$1]++}} END{for(color in fruit){for(type in fruit[color]){print color " " type " " fruit[color][type]}}}'

This will give the following output:

yellow lemons 1
yellow apples 2
yellow grapes 3
green lemons 2
green apples 3
green grapes 1

If you want more in the style of a matrix, you can add one extra array to track the available colors and use printf in stead of print:

awk '{ if(NF==2){fruit[$1][$2]++; colors[$2]=$2}} END{printf("type");for(color in colors){printf("\t%s",colors[color])};printf("\n"); for(type in fruit){printf("%s",type);for(color in fruit[type]){ printf("\t%d",fruit[type][color]) }printf("\n")}}'

Which gives:

type    yellow  green
lemons  1       2
apples  2       3
grapes  3       1

It's a bit messy, but you can simplify it if you don't care about the header:

awk '{if(NF==2){fruit[$1][$2]++;}} END{for(type in fruit){printf("%s",type);for(color in fruit[type]){printf("\t%d",fruit[type][color]) }printf("\n")}}'

Will give:

lemons  1       2
apples  2       3
grapes  3       1