Miller: selecting columns based on their last value

174 views Asked by At

I'm processing a TSV file with a hundred columns and I would like to pick some of them based on their last values (ie. the value on the last row).

For instance with the following data (and I let you imagine 96 more columns!):

Year    An  Ti  Gi  Na
2000    5   10  2   3
2010    3   2   5   7
2020    6   3   5   6

I would like to select all columns whose value on the last row is at least 5 (and, better, I would like to select all columns whose value on the last row is at least the value of the row Gi).

So in the end, I would like the following output (the column Ti is discarded because its value on the last row is 3 which is below our threshold).

Year    An  Gi  Na
2000    5   2   3
2010    3   5   7
2020    6   5   6

Do you have any ideas?

Thanks!

1

There are 1 answers

4
aborruso On BEST ANSWER

You could write this bash script

#!/bin/bash

gi_value=$(mlr --t2n tail -n 1 then cut -f Gi input.tsv)

gi_columns=$(mlr --t2n tail -n 1 then label 1 then reshape -r '^[a-zA-Z]' -o k,v then filter '$v >= '"$gi_value"'' then cut -f k then nest --ivar "," -f k input.tsv)

mlr --tsv cut -f Year,"$gi_columns" input.tsv

First extract the Gi value

mlr --t2n tail -n 1 then cut -f Gi input.tsv

to populate the gi_value var with 5 value.

Then extract the list of columns of the last row in wich the value is >= 5

mlr --t2n tail -n 1 \
then label 1 \
then reshape -r '^[a-zA-Z]' -o k,v \
then filter '$v >= '"$gi_value"'' \
then cut -f k \
then nest --ivar "," -f k input.tsv

This gives you in output An,Gi,Na.

The last step is to cut An,Gi,Na columns:

mlr --tsv cut -f Year,"$gi_columns" input.tsv

The output is

Year    An      Gi      Na
2000    5       2       3
2010    3       5       7
2020    6       5       6