How do I loop over several files, keeping the base name for further processing?

289 views Asked by At

I have multiple text files that need to be tokenised, POS and NER. I am using C&C taggers and have run their tutorial, but I am wondering if there is a way to tag multiple files rather than one by one.

At the moment I am tokenising the files:

bin/tokkie --input working/tutorial/example.txt--quotes delete --output working/tutorial/example.tok

as follows and then Part of Speech tagging:

bin/pos --input working/tutorial/example.tok --model models/pos --output working/tutorial/example.pos

and lastly Named Entity Recognition:

bin/ner --input working/tutorial/example.pos --model models/ner --output working/tutorial/example.ner

I am not sure how I would go about creating a loop to do this and keep the file name the same as the input but with the extension representing the tagging it has. I was thinking of a bash script or perhaps Perl to open the directory but I am not sure on how to enter the C&C commands in order for the script to understand.

At the moment I am doing it manually and it's pretty time consuming to say the least!

2

There are 2 answers

0
daxim On BEST ANSWER

Untested, likely needs some directory mangling.

use autodie qw(:all);
use File::Basename qw(basename);

for my $text_file (glob 'working/tutorial/*.txt') {
    my $base_name = basename($text_file, '.txt');
    system 'bin/tokkie',
        '--input'  => "working/tutorial/$base_name.txt",
        '--quotes' => 'delete',
        '--output' => "working/tutorial/$base_name.tok";
    system 'bin/pos',
        '--input'  => "working/tutorial/$base_name.tok",
        '--model'  => 'models/pos',
        '--output' => "working/tutorial/$base_name.pos";
    system 'bin/ner',
        '--input'  => "working/tutorial/$base_name.pos",
        '--model'  => 'models/ner',
        '--output' => "working/tutorial/$base_name.ner";
}
0
Dennis Williamson On

In Bash:

#!/bin/bash
dir='working/tutorial'
for file in "$dir"/*.txt
do
    noext=${file/%.txt}

    bin/tokkie --input "$file" --quotes delete --output "$noext.tok"

    bin/pos --input "$noext.tok" --model models/pos --output "$noext.pos"

    bin/ner --input "$noext.pos" --model models/ner --output "$noext.ner"

done