Sorting unique elements column-wise in a text-file

110 views Asked by At

I have a tab delimited file like

chr1    4359314 4361314 Rp1 -
chr1    4492735 4494735 Sox17   -
chr1    4495330 4498354 Sox17,Sox17,Sox17,Sox17,Sox17,Sox17 -,-,-,-,-,-
chr1    4784698 4786739 Mrpl15,Mrpl15,Mrpl15,Mrpl15 -,-,-,-
chr1    4806788 4809237 Lypla1,Lypla1,Lypla1,RP24-426M1.3,Lypla1,Lypla1,Lypla1,Lypla1   +,+,+,+,+,+,+,+
chr1    4856814 4859038 Tcea1,Tcea1 +,+
chr1    5017735 5020539 Rgs20,Rgs20,Rgs20   -,-,-
chr1    5069018 5071285 Atp6v1h,Rgs20,Rgs20 +,-,-
chr1    5082080 5084154 Atp6v1h,Atp6v1h,Atp6v1h,Atp6v1h +,+,+,+
chr1    5587493 5589941 Oprk1,Oprk1,Oprk1   +,+,+

I want to filter out by column 5, only the lines which can have any amount and order of "+" or "-" meaning multiple instances per line like +,-,+ or -,-,+ or +,+,+,+,- etc but should be mixed not cases like -,- or -,-,-,- or +,+,+

Output

chr1    5069018 5071285 Atp6v1h,Rgs20,Rgs20 +,-,-

I tried using extended grep like

cut -f5 file | egrep '(+.*-)|(-.*+)' | head

but I cant make it work for multiple matches per line, any order. Can anyone suggest a minimalist way(regex/one-liner) to do it, without specifying different order. (sed/awk preferred)

Brings me to another question. Can I sort -u but columnwise

cut -f5 file | tr ',' '\t'| sort -uk1???

Input

-
-
-,-,-,-,-,-
-,-,-,-
+,+,+,+,+,+,+,+
+,+
-,-,-
+,-,-
+,+,+,+
+,+,+

Output:

-
-
-
-
+
+
-
+-
+
+
1

There are 1 answers

4
hek2mgl On BEST ANSWER

I would use the following awk command:

awk '$5 !~ /^(\+,)*\+$/ && $5 !~ /^(-,)*\-$/' file

It checks whether $5 (column 5) contains a sequence of +,..,+ or a sequence of -,..,-. If not, the line gets printed.

Output:

chr1    5069018 5071285 Atp6v1h,Rgs20,Rgs20 +,-,-