Read same extension multiple files in one directory in Perl

914 views Asked by At

I currently have an issue with reading files in one directory. I need to take all the fastq files in a file and run the script for each file then put new files in an ‘Edited_sequences’ folder. The one script I had is

perl -ne '$i++; if($i<80001){print}' BM2003_TCCCAGAACAAC_L001_R1_001.fastq > ./Edited_sequences/BM2003_TCCCAGAACAAC_L001_R1_001.fastq

It takes the first 80000 lines in one fastq file then outputs the result. Now for example I have 2000 fastq files, then I need to copy and paste for 2000 times. I know there is a glob command suit for this situation but I just do not know how to deal with that. Please help me out.

3

There are 3 answers

0
mpapec On BEST ANSWER

You can use perl to do copy/paste for you, first argument *.fastq are all fastq files, and second ./Edited_sequences is target folder for new files,

perl -e '$d=pop; `head -8000 "$_" > "$d/$_"` for @ARGV' *.fastq ./Edited_sequences
0
David W. On

You have two choices:

  • Use Perl to read in the 2000 files and run it as part of your program
  • Use the Shell to pass each of those 2000 file to your command line

Here's the bash alternative:

for file in *.fastq
do
    perl -ne '$i++; if($i<80001){print}' "$file" > "./Edited_sequences/$file"
done

Your same Perl script, but with the shell finding each file. This should work and not overload the command line. The for loop in bash, if handed a glob can expand them correctly.

However, I always recommend that you don't actually execute the command, but echo the resulting commands into a file:

for file in *.fastq
do
    echo "perl -ne '\$i++; if(\$i<80001){print}' \
\"$file\" > \"./Edited_sequences/$file\""    >> myoutput.txt
done

Then, you can look at myoutput.txt to make sure it looks good before you actually do any real harm. Once you've determined that myoutput.txt is a good file, you can execute that as a shell script:

$ bash myoutput.txt
0
rutter On

glob gets you an array of filenames matching a particular expression. It's frequently used with <> brackets, a lot like reading input (you can think of it as reading files from a directory).

This is a simple example that will print the names of every ".fastq" file in the current directory:

print "$_\n" for <*.fastq>;

The important part is <*.fastq>, which gives us an array of filenames matching that expression (in this case, a file extension). If you need to change which directory your Perl script is working in, you can use chdir.

From there, we can process your files as needed:

while (my $filename = <*.fastq>) {
    open(my $in, '<', $filename) or die $!;
    open(my $out, '>', "./Edited_sequences/$filename") or die $!;

    for (1..80000) {
        my $line = <$in>;
        print $out $line;
    }
}