How to regex search/replace with File::Map in largish text file with to avoid "Out of Memory"-Error?

1.5k views Asked by At

UPDATE 2: Solved. See below.

I am in the process of converting a big txt-file from an old DOS-based library program into a more usable format. I just started out in Perl and managed to put together a script such as this one:

BEGIN {undef $/; };
open $in,  '<',  "orig.txt"      or die "Can't read old file: $!"; 
open $out, '>',  "mod.txt"  or die "Can't write new file: $!";
while( <$in> )  
{
$C=s/foo/bar/gm;
print "$C matches replaced.\n"
etc...
print $out $_;
}
close $out;

It is quite fast but after some time I always get an "Out of Memory"-Error due lack of RAM/Swap-Space (I'm on Win XP with 2GB of Ram and a 1.5GB Swap-File). After having looked around a bit on how to deal with big files, File::Map seemed to me as a good way to avoid this problem. I'm having trouble implementing it, though. This is what I have for now:

#!perl -w
use strict; 
use warnings;
use File::Map qw(map_file);

my $out = 'output.txt';
map_file my $map, 'input.txt', '<';
$map =~ s/foo/bar/gm;

print $out $map;

However I get the following error: Modification of a read-only value attempted at gott.pl line 8.

Also, I read on the File::Map help page, that on non-Unix systems I need to use binmode. How do I do that?

Basically, what I want to do is to "load" the file via File::Map and then run code like the following:

$C=s/foo/bar/gm;
print "$C matches found and replaced.\n"

$C=s/goo/far/gm;
print "$C matches found and replaced.\n"
while(m/complex_condition/gm)
{ 
$C=s/complex/regex/gm;
$run_counter++;
}
print "$C matches replaced. Script looped $run_counter times.\n";
etc...

I hope that I didn't overlook something too obvious but the example given on the File::Map help page only shows how to read from a mapped file, correct?

EDIT:

In order to better illustrate what I currently can't accomplish due to running out of memory I'll give you an example:

On http://pastebin.com/6Ehnx6xA is a sample of one of our exported library records (txt-format). I'm interested in the +Deskriptoren: part starting on line 46. These are thematic classifiers which are organised in a tree hierarchy.

What I want is to expand each classifier with its complete chain of parent nodes, but only if none of the parent nodes are not already present before or after the child node in question. This means turning

+Deskriptoren
-foo
-Cultural Revolution
-bar

into

+Deskriptoren
-foo
-History
-Modern History
-PRC
-Cultural Revolution
-bar

The currently used Regex makes use of Lookbehind and Lookahead in order to avoid duplicates duplicates and is thus slightly more complicated than s/foo/bar/g;:

s/(?<=\+Deskriptoren:\n)((?:-(?!\QParent-Node\E).+\n)*)-(Child-Node_1|Child-Node_2|...|Child-Node_11)\n((?:-(?!Parent-Node).+\n)*)/${1}-Parent-Node\n-${2}\n${3}/g;

But it works! Until Perl runs out of memory that is... :/

So in essence I need a way to do manipulations on a large file (80MB) over several lines. Processing time is not an issue. This is why I thought of File::Map. Another option could be to process the file in several steps with linked perl-scripts calling each other and then terminating, but I'd like to keep it as much in one place as possible.

UPDATE 2:

I managed to get it working with Schwelm's code below. My script now calls the following subroutine which calls two nested subroutines. Example code is at: http://pastebin.com/SQd2f8ZZ

Still not quite satisfied that I couldn't get File::Map to work. Oh well... I guess that the line-approach is more efficient anyway.

Thanks everyone!

3

There are 3 answers

4
Schwern On BEST ANSWER

Some simple parsing can break the file down into manageable chunks. The algorithm is:

1. Read until you see `+Deskriptoren:`
2. Read everything after that until the next `+Foo:` line
3. Munge that bit.
4. Goto 1.

Here's the sketch of the code:

use strict;
use warnings;
use autodie;

open my $in,  $input_file;
open my $out, $output_file;

while(my $line = <$in>) {
    # Print out everything you don't modify
    # this includes the +Deskriptoren line.
    print $out $line;

    # When the start of a description block is seen, slurp in up to
    # the next section.
    if( $line =~ m{^ \Q Deskriptoren: }x ) {
        my($section, $next_line) = _read_to_next_section($in);

        # Print the modified description
        print $out _munge_description($section);

        # And the following header line.
        print $out $next_line;
    }
}

sub _read_to_next_section {
    my $in = shift;

    my $section = '';
    my $line;
    while( $line = <$in> ) {
        last if $line =~ /^ \+ /x;
        $section .= $line;
    }

    # When reading the last section, there might not be a next line
    # resulting in $line begin undefined.
    $line = '' if !defined $line;
    return($section, $line);
}

# Note, the +Deskriptoren line is not on $description
sub _munge_description {
    my $description = shift;

    ...whatever you want to do to the description...

    return $description;
}

I haven't tested it, but something like that should do you. It has the advantage over dealing with the whole file as a string (File::Map or otherwise) that you can deal with each section individually rather than trying to cover every base in one regex. It also will let you develop a more sophisticated parser to deal with things like comments and strings that might mess up the simple parsing above and would be a huge pain to adapt a massive regex to.

1
unpythonic On

You are using mode <, which is read-only. If you want to modify the contents, you need read-write access, so you should be using +<.

If you are on windows, and need binary mode, then you should open the file separately, set binary mode on the file handle, then map from that handle.

I also noticed that you have an input file and an output file. If you use File::Map, you are changing the file in-place... that is, you can't open the file for reading and change the contents of a different file. You would need to copy the file, then modify the copy. I've done so below.

use strict;
use warnings;

use File::Map qw(map_file);
use File::Copy;

copy("input.txt", "output.txt") or die "Cannot copy input.txt to output.txt: $!\n";

open my $fh, '+<', "output.txt"
    or die "Cannot open output.txt in r/w mode: $!\n";

binmode($fh);

map_handle my $contents, $fh, '+<';

my $n_changes = ( $contents =~ s/from/to/gm );

unmap($contents);
close($fh);

The documentation for File::Map isn't very good on how errors are signaled, but from the source, it looks as if $contents being undefined would be a good guess.

0
FMc On

When you set $/ (the input record separator) to undefined, you are "slurping" the file -- reading the entire content of the file at once (this is discussed in perlvar, for example). Hence the out-of-memory problem.

Instead, process it one line at a time, if you can:

while (my $line = <$in>){
    # Do stuff.
}

In situations where the file is small enough and you do slurp the file, there is no need for the while loop. The first read gets everything:

{
    local $/ = undef;
    my $file_content = <>;
    # Do stuff with the complete file.
}

Update

After seeing your massive regex I would urge you reconsider your strategy. Tackle this as a parsing problem: process the file one line at a time, storing information about the parser's state as needed. This approach allows you to work with the information using simple, easily understood (even testable) steps.

Your current strategy -- one might call it the slurp and whack with massive regex strategy -- is difficult to understand and maintain (in 3 months will your regex makes immediate sense to you?), difficult to test and debug, and difficult to adjust if you discover unanticipated deviations from your initial understanding of the data. In addition, as you've discovered, the strategy is vulnerable to memory limitations (because of the need to slurp the file).

There are many questions on StackOverflow illustrating how one can parse text when the meaningful units span multiple lines. Also see this question, where I delivered similar advice to another questioner.