Perl remove StopWords from string

973 views Asked by At

I'm using this script in order to remove Stop-Words in Perl, I'm running on Windows and I could not find compatible version of:

Lingua::EN::StopWordList
Lingua::StopWords qw(getStopWords)

I have an array of Stop-Words but once I use the REGEX below, I loss critical white-space that causes words to clash. Note that the every word in the Stop-Word array has two spaces, one on the right and one on the left.

How can I remove Stop-Words efficiently without losing crucial white-spaces?

use strict;
use warnings;
use utf8;
use IO::File;
use String::Util 'trim';

my $inFile = "C:\\Users\\David\\Downloads\\InfoRet\\Explore the ways to get better grades.txt";
my $inFh = new IO::File $inFile, "r";
my $lineNum = 0;
my $line = undef;
my $loc = undef;
my $str = undef;

my @stopList = (" the ", " a ", " an ", " of ", " and ", " on ", " in ", " by ", " with ", " at ", " after ", " into ", " their ", " is ",  " that ", " they ", " for ", " to ", " it ", " them ", " which ");

for(my $i = 1; $i <= 4; $i++) {
    <$inFh>
}

while($line = <$inFh>) {
    $lineNum++;
    chomp $line;
    $line =~ s/[\$#@~!&*()\[\];.,:?^`\\\/]+//g;

    for my $planet (@stopList) {
        $loc = index($line, $planet);
        if($loc!=(-1)) {
            #$line =~ s/$str//g;
            $line =~ s/$planet//g;
        }
    }
    print "$line\n";
}
1

There are 1 answers

0
mpapec On BEST ANSWER
my @stopList = ("the", "a", "an", "of", ..);
my ($rx) = map qr/(?:$_)/, join "|", map qr/\b\Q$_\E\b/, @stopList;

and later,

$line =~ s/$rx//g;