Sum rows of different columns by match in the second column

86 views Asked by At

I have a file like this:

ID  Category Sample1 Sample2 Sample3
1   A        5       5       5
2   A        5       5       5
3   A        5       5       5
4   B        1       2       3
5   B        1       2       3

And I'm looking for an awk, sed or similar solution to achieve this:

ID  Category Sample1 Sample2 Sample3
1   A        15      15      15
4   B        2       4       6

The idea is to sum rows of matching categories, considering the values of each sample column and remove the repeated IDs.

1

There are 1 answers

0
7stud On

data1.txt:

ID  Category Sample1 Sample2 Sample3
1   A        5       5       5
2   A        5       5       5
3   A        5       5       5
4   B        1       2       3
5   B        1       2       3

Code:

use strict;
use warnings; 
use 5.020;
use autodie;
use Data::Dumper;
use List::MoreUtils qw{ uniq };

open my $INFILE, '<', 'data1.txt';

my %results;
my @cat_order;

#Get names of column headers:
my ($h1, $h2, @sample_names) =  split ' ', <$INFILE>;

while (my $line = <$INFILE>) {
    my($id, $cat, @samples)= split ' ', $line;
    push @cat_order, $cat;

    push @{ $results{$cat}{$h1} }, $id;  #e.g. push results{A}{ID}, 1 

    while ( my($i, $sample) = each @samples ) {
        $results{$cat}{ $sample_names[$i] } += $sample;  #e.g. results{A}{Sample1} += 5   
    }
}

close $INFILE;
open my $OUTFILE, '>', 'results.txt';

my $format = "%-3s %-9s %-9s %-9s %-9s\n";
printf {$OUTFILE} $format, $h1, $h2, @sample_names;

for my $cat (uniq(@cat_order)) {
    printf( $OUTFILE 
        $format, 
        $results{$cat}{$h1}[0],  #e.g. results{A}{ID}[0], which is the first id in the ID array
        $cat,  #e.g. A
        @{ $results{$cat} }{@sample_names}  #e.g. results{A}{'Sample1', 'Sample2', 'Sample3'}  -- a hash slice,
                                            #which returns an array of the values matching those keys.
    )
}

close $OUTFILE;

Output:

$ rm results.txt
remove results.txt? y

$ perl 1.pl 

$ cat results.txt 
ID  Category  Sample1   Sample2   Sample3  
1   A         15        15        15       
4   B         2         4         6  

And with data1.txt:

ID  Category Sample1 Sample2 Sample3
1   A        5       5       5
2   B        1       2       3
3   A        5       5       5
4   C        10      11      12
5   B        1       2       3
6   A        5       5       5
7   C        1       1       1

Output:

$ rm results.txt
remove results.txt? y

$ perl 1.pl 
$ cat results.txt 
ID  Category  Sample1   Sample2   Sample3  
1   A         15        15        15       
2   B         2         4         6        
4   C         11        12        13  

%results:

$VAR1 = {
          'C' => {
                   'ID' => [
                             '4',
                             '7'
                           ],
                   'Sample1' => 11,
                   'Sample3' => 13,
                   'Sample2' => 12
                 },
          'B' => {
                   'ID' => [
                             '2',
                             '5'
                           ],
                   'Sample1' => 2,
                   'Sample3' => 6,
                   'Sample2' => 4
                 },
          'A' => {
                   'Sample1' => 15,
                   'ID' => [
                             '1',
                             '3',
                             '6'
                           ],
                   'Sample2' => 15,
                   'Sample3' => 15
                 }
        };