UPDATE 2: Solved. See below.
I am in the process of converting a big txt-file from an old DOS-based library program into a more usable format. I just started out in Perl and managed to put together a script such as this one:
BEGIN {undef $/; };
open $in, '<', "orig.txt" or die "Can't read old file: $!";
open $out, '>', "mod.txt" or die "Can't write new file: $!";
while( <$in> )
{
$C=s/foo/bar/gm;
print "$C matches replaced.\n"
etc...
print $out $_;
}
close $out;
It is quite fast but after some time I always get an "Out of Memory"-Error due lack of RAM/Swap-Space (I'm on Win XP with 2GB of Ram and a 1.5GB Swap-File).
After having looked around a bit on how to deal with big files, File::Map
seemed to me as a good way to avoid this problem. I'm having trouble implementing it, though.
This is what I have for now:
#!perl -w
use strict;
use warnings;
use File::Map qw(map_file);
my $out = 'output.txt';
map_file my $map, 'input.txt', '<';
$map =~ s/foo/bar/gm;
print $out $map;
However I get the following error: Modification of a read-only value attempted at gott.pl line 8.
Also, I read on the File::Map
help page, that on non-Unix systems I need to use binmode
. How do I do that?
Basically, what I want to do is to "load" the file via File::Map and then run code like the following:
$C=s/foo/bar/gm;
print "$C matches found and replaced.\n"
$C=s/goo/far/gm;
print "$C matches found and replaced.\n"
while(m/complex_condition/gm)
{
$C=s/complex/regex/gm;
$run_counter++;
}
print "$C matches replaced. Script looped $run_counter times.\n";
etc...
I hope that I didn't overlook something too obvious but the example given on the File::Map
help page only shows how to read from a mapped file, correct?
EDIT:
In order to better illustrate what I currently can't accomplish due to running out of memory I'll give you an example:
On http://pastebin.com/6Ehnx6xA is a sample of one of our exported library records (txt-format). I'm interested in the +Deskriptoren:
part starting on line 46. These are thematic classifiers which are organised in a tree hierarchy.
What I want is to expand each classifier with its complete chain of parent nodes, but only if none of the parent nodes are not already present before or after the child node in question. This means turning
+Deskriptoren
-foo
-Cultural Revolution
-bar
into
+Deskriptoren
-foo
-History
-Modern History
-PRC
-Cultural Revolution
-bar
The currently used Regex makes use of Lookbehind and Lookahead in order to avoid duplicates duplicates and is thus slightly more complicated than s/foo/bar/g;
:
s/(?<=\+Deskriptoren:\n)((?:-(?!\QParent-Node\E).+\n)*)-(Child-Node_1|Child-Node_2|...|Child-Node_11)\n((?:-(?!Parent-Node).+\n)*)/${1}-Parent-Node\n-${2}\n${3}/g;
But it works! Until Perl runs out of memory that is... :/
So in essence I need a way to do manipulations on a large file (80MB) over several lines. Processing time is not an issue. This is why I thought of File::Map. Another option could be to process the file in several steps with linked perl-scripts calling each other and then terminating, but I'd like to keep it as much in one place as possible.
UPDATE 2:
I managed to get it working with Schwelm's code below. My script now calls the following subroutine which calls two nested subroutines. Example code is at: http://pastebin.com/SQd2f8ZZ
Still not quite satisfied that I couldn't get File::Map
to work. Oh well... I guess that the line-approach is more efficient anyway.
Thanks everyone!
Some simple parsing can break the file down into manageable chunks. The algorithm is:
Here's the sketch of the code:
I haven't tested it, but something like that should do you. It has the advantage over dealing with the whole file as a string (File::Map or otherwise) that you can deal with each section individually rather than trying to cover every base in one regex. It also will let you develop a more sophisticated parser to deal with things like comments and strings that might mess up the simple parsing above and would be a huge pain to adapt a massive regex to.