I am having some complications with matching strings to each other and I was wondering if someone could lend a hand?

Say I have the following table:

broken
vector
unidentified
synthetic
artificial

And I have a second dataset that looks like this:

org1    Fish
org2    Amphibian
org3    vector
org4    synthetic species
org5    Mammal

Now I want to remove all the rows from the second table that match the string from the first table so that the output looks like this:

org1    Fish
org2    Amphibian
org5    Mammal 

I was thinking of using grep -v in bash but I am not quite sure how to make it loop through all the strings in table 1.

I am trying to work it out in perl but for some reason it returns all my values instead of just the ones that match. Any ideas why?

My script looks like this:

#!/bin/perl -w

($br_str, $dataset) = @ARGV;
open($fh, "<", $br_str) || die "Could not open file $br_str/n $!";

while (<$fh>) {
        $str = $_;
        push @strings, $str;
        next;
    }

open($fh2, "<", $dataset) || die "Could not open file $dataset $!/n";

while (<$fh2>) {
    chomp;
    @tmp = split /\t/, $_;
    $groups = $tmp[1];
    foreach $str(@strings){
        if ($str ne $groups){
            @working_lines = @tmp;
            next;
        }
    }
        print "@working_lines\n";
}

1 Answers

2
toolic On Best Solutions

chomp your input and use a hash for your first table:

use warnings;
use strict;

my ( $br_str, $dataset ) = @ARGV;
open(my $fh, "<", $br_str ) || die "Could not open file $br_str/n $!";

my %strings;
while (<$fh>) {
    chomp;
    $strings{$_}++;
}

open(my $fh2, "<", $dataset ) || die "Could not open file $dataset $!/n";
while (<$fh2>) {
    chomp;
    my @tmp = split /\s+/, $_;
    my $groups = $tmp[1];
    print "$_\n" unless exists $strings{$groups};
}

Note that I used \s+ instead of \t, just to make my copy/paste easier.