I have a tab delimited file like
chr1 4359314 4361314 Rp1 -
chr1 4492735 4494735 Sox17 -
chr1 4495330 4498354 Sox17,Sox17,Sox17,Sox17,Sox17,Sox17 -,-,-,-,-,-
chr1 4784698 4786739 Mrpl15,Mrpl15,Mrpl15,Mrpl15 -,-,-,-
chr1 4806788 4809237 Lypla1,Lypla1,Lypla1,RP24-426M1.3,Lypla1,Lypla1,Lypla1,Lypla1 +,+,+,+,+,+,+,+
chr1 4856814 4859038 Tcea1,Tcea1 +,+
chr1 5017735 5020539 Rgs20,Rgs20,Rgs20 -,-,-
chr1 5069018 5071285 Atp6v1h,Rgs20,Rgs20 +,-,-
chr1 5082080 5084154 Atp6v1h,Atp6v1h,Atp6v1h,Atp6v1h +,+,+,+
chr1 5587493 5589941 Oprk1,Oprk1,Oprk1 +,+,+
I want to filter out by column 5, only the lines which can have any amount and order of "+" or "-" meaning multiple instances per line like +,-,+ or -,-,+ or +,+,+,+,- etc but should be mixed not cases like -,- or -,-,-,- or +,+,+
Output
chr1 5069018 5071285 Atp6v1h,Rgs20,Rgs20 +,-,-
I tried using extended grep like
cut -f5 file | egrep '(+.*-)|(-.*+)' | head
but I cant make it work for multiple matches per line, any order. Can anyone suggest a minimalist way(regex/one-liner) to do it, without specifying different order. (sed/awk preferred)
Brings me to another question. Can I sort -u
but columnwise
cut -f5 file | tr ',' '\t'| sort -uk1???
Input
-
-
-,-,-,-,-,-
-,-,-,-
+,+,+,+,+,+,+,+
+,+
-,-,-
+,-,-
+,+,+,+
+,+,+
Output:
-
-
-
-
+
+
-
+-
+
+
I would use the following
awk
command:It checks whether
$5
(column 5) contains a sequence of+,..,+
or a sequence of-,..,-
. If not, the line gets printed.Output: