Phrase search in a text file

295 views Asked by At

Given a phrase like "I am searching for a text" and one text file that contains the list of words.

I have to find the whether each and every combination of the word present in the text file.

For example, I have to search for the occurrence "I", "I am", "I am searching", "I am searching for", "searching for" etc.

I prefer to write this in perl and I needed a optimal solution that runs faster.

Example text file :

I \n
am searching \n
Text \n
searching for \n 
searching for a \n
for searching       ---> my program should not match this 
etc
2

There are 2 answers

2
Uri London On BEST ANSWER

The code below prints all the sub_phrases that you want to match.

$phrase = 'I am searching for a text';
$\ = "\n";

@words = ();
print "Indices:";
while( $phrase =~ /\b\w+\b/g ) {
    push @words, {word => $&, begin => $-[0], end => $+[0]};
}

$num_words = $#words + 1;
print 'there are ', $num_words, ' words';


for( $i=0; $i<$num_words; $i++ ) {
    for( $j=$i; $j<$num_words; $j++ ) {
        ($start,$finish) = ($words[$i]->{begin}, $words[$j]->{end});
        $sub_phrase = substr $phrase, $start, $finish-$start;
        print "$i-$j: $sub_phrase";
    }
}

some explanations:

  1. $\ just to make 'print' easier
  2. $phrase - using your sample
  3. @words is an array of references to records
  4. each record is a hash with the word itself, index to the beginning and index to the end of the word
  5. I've a regular expression, and I'm iterating. I'm looking for a word boundary, 1 or more word character, and a word boundary.
  6. $+ and $- are special variables for the indices of the match of the last RE
  7. $& is a special variable for the match of the last RE
  8. I then have a nested loop: $i, the outer loop variable is the first word. $j is the last word. That covers all the combinations.
  9. I'm calculating $sub_phrase from the beginning of the first word, to the end of the last word.

To complete your exercise, you want to save all the sub_phrase's into an array (instead of 'print' do 'push' into an @permutations). then iterate through your text file, and for each line, try to match against each permutation.

2
Axeman On

You can construct an expression that works for all those cases. Below, I show how to construct one in Perl (although you can just use the product for your purposes).

use List::Util qw<reduce>;

our ( $a, $b );

my $regex       
    = "\n^\n( "
    . join( "\n| "
    , @{( reduce { 
            my $r = ref( $a ) ? $a : [ "$a " ];
            my $s = $r->[0];
            [ "$b (?> [ ] $s)?", @$r ] 
        } 
        reverse split ' ', 'I am searching for a text'
        )}
    )
    . "\n)\n\\s*\n\$"
    ;
say join( "\n# ", split "\n", $regex );

# ^
# ( I (?> [ ] am (?> [ ] searching (?> [ ] for (?> [ ] a (?> [ ] text )?)?)?)?)?
# | am (?> [ ] searching (?> [ ] for (?> [ ] a (?> [ ] text )?)?)?)?
# | searching (?> [ ] for (?> [ ] a (?> [ ] text )?)?)?
# | for (?> [ ] a (?> [ ] text )?)?
# | a (?> [ ] text )?
# | text 
# )
# \s*
# $

map { say foreach m/$regex/xo } <DATA>;
  • I have added the anchors, since you indicated that it should match the whole line.
  • There are spaces in the finished regex, but it uses /x to ignore them. That is why we specify the space with [ ].
  • The grouping notation (?>...) is a variation on the non-capturing (?:...), but fails a lot faster. See http://perldoc.perl.org/perlre.html#(%3f%3epattern)
  • See List::Util::reduce