Phrase search in a text file

Question

Phrase search in a text file

292 views Asked by Sishanth At 25 October 2012 at 12:20

Given a phrase like "I am searching for a text" and one text file that contains the list of words.

I have to find the whether each and every combination of the word present in the text file.

For example, I have to search for the occurrence "I", "I am", "I am searching", "I am searching for", "searching for" etc.

I prefer to write this in perl and I needed a optimal solution that runs faster.

Example text file :

I \n
am searching \n
Text \n
searching for \n 
searching for a \n
for searching       ---> my program should not match this 
etc

Original Q&A

There are 2 answers

Axeman On 25 October 2012 at 15:10

You can construct an expression that works for all those cases. Below, I show how to construct one in Perl (although you can just use the product for your purposes).

use List::Util qw<reduce>;

our ( $a, $b );

my $regex       
    = "\n^\n( "
    . join( "\n| "
    , @{( reduce { 
            my $r = ref( $a ) ? $a : [ "$a " ];
            my $s = $r->[0];
            [ "$b (?> [ ] $s)?", @$r ] 
        } 
        reverse split ' ', 'I am searching for a text'
        )}
    )
    . "\n)\n\\s*\n\$"
    ;
say join( "\n# ", split "\n", $regex );

# ^
# ( I (?> [ ] am (?> [ ] searching (?> [ ] for (?> [ ] a (?> [ ] text )?)?)?)?)?
# | am (?> [ ] searching (?> [ ] for (?> [ ] a (?> [ ] text )?)?)?)?
# | searching (?> [ ] for (?> [ ] a (?> [ ] text )?)?)?
# | for (?> [ ] a (?> [ ] text )?)?
# | a (?> [ ] text )?
# | text 
# )
# \s*
# $

map { say foreach m/$regex/xo } <DATA>;

I have added the anchors, since you indicated that it should match the whole line.
There are spaces in the finished regex, but it uses /x to ignore them. That is why we specify the space with [ ].
The grouping notation (?>...) is a variation on the non-capturing (?:...), but fails a lot faster. See http://perldoc.perl.org/perlre.html#(%3f%3epattern)
See List::Util::reduce

**Uri London** · Accepted Answer · 2012-10-25T14:53:10+00:00

The code below prints all the sub_phrases that you want to match.

$phrase = 'I am searching for a text';
$\ = "\n";

@words = ();
print "Indices:";
while( $phrase =~ /\b\w+\b/g ) {
    push @words, {word => $&, begin => $-[0], end => $+[0]};
}

$num_words = $#words + 1;
print 'there are ', $num_words, ' words';


for( $i=0; $i<$num_words; $i++ ) {
    for( $j=$i; $j<$num_words; $j++ ) {
        ($start,$finish) = ($words[$i]->{begin}, $words[$j]->{end});
        $sub_phrase = substr $phrase, $start, $finish-$start;
        print "$i-$j: $sub_phrase";
    }
}

some explanations:

$\ just to make 'print' easier
$phrase - using your sample
@words is an array of references to records
each record is a hash with the word itself, index to the beginning and index to the end of the word
I've a regular expression, and I'm iterating. I'm looking for a word boundary, 1 or more word character, and a word boundary.
$+ and $- are special variables for the indices of the match of the last RE
$& is a special variable for the match of the last RE
I then have a nested loop: $i, the outer loop variable is the first word. $j is the last word. That covers all the combinations.
I'm calculating $sub_phrase from the beginning of the first word, to the end of the last word.

To complete your exercise, you want to save all the sub_phrase's into an array (instead of 'print' do 'push' into an @permutations). then iterate through your text file, and for each line, try to match against each permutation.

TechQA.

Phrase search in a text file

There are 2 answers

Related Questions in PERL

Related Questions in SEARCH

Related Questions in TEXT

Related Questions in PHRASES

Popular Questions

Popular Tags

Trending Questions