Regex matching over multiple lines

234 views Asked by At

I'm currently trying to do some basic cleaning on a pdf so I can convert it to ePub for use on my e-reader. All I'm doing is removing page numbers (easy) and footnotes (stumped so far). Basically, I'd like an expression that finds the tag pattern at the beginning of every footnote ( <bar> followed by a newline, a number, and either a letter or a quotation mark), selects the pattern and everything after it until it reaches the <hr/1> tag at the beginning of the next page. Here's some sample text:

The phantoms, for so they then seemed, were flitting on the other side of <br>
the deck, and, with a noiseless celerity, were casting loose the tackles and bands <br>
of the boat which swung there. This boat had always been deemed one of the spare boats <br>
technically called the captain’s, on account of its hanging from the starboard quarter.<br>
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
 <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>

Since all of the footnotes are formatted this way, I want to select every group of lines that begin with <br> (note the space) and end with the <hr/> tag. This is my first time really trying to use regex, so I've tried to hack together some attempts at solutions:

  1. \s<br>\n\d+\s[a-zA-Z“].*: This correctly selects <br> and the first line of the footnote, but stops at the break. \s<br>\n\d+\s[a-zA-Z“].*\n.*\n.*\n.*\n.*\n.* selects the correct number of lines, but this will obviously only work for footnotes that happen to have three lines of text.

  2. \s<br>\n\d+\s[a-zA-Z“]((.*\n)*)<hr\/> starts at the correct place at the first footnote, but then ends up selecting the entirety of the rest of the document. My interpretation of this expression is "start with <br>, a number followed by a space followed by a letter or quotation mark, then select everything including newlines until you reach <hr/>."

  3. \s<br>\n\d+\s[a-zA-Z“]((?:.*\r?\n?)*)<hr\/>\n same idea as (2), with the same result, though I am not familiar enough with regex to quite understand what is going on with this one.

Basically, my problem is that my expressions either exclude newlines (and ignores the end pattern) or it include every newline and returns the entirety of the text (and obviously still ignores the ending pattern.

How do I get it to return just the text between the patterns, including the newlines?

1

There are 1 answers

1
lordadmira On BEST ANSWER

Your tries were pretty close. In the first one you probably need to set the flag that allows the . to match line feeds. It normally doesn't. In your second, you need to set the non-greedy ? mode on the anything match .*. Otherwise .* tries to match the entire rest of the text.

It would be something like this. /^ <br>\n\d+\s[a-zA-Z"“](.*?\n)*?<hr\/>/

But anyway, this is something that is best done in Perl. Perl is where all the advanced regex comes from.

use strict;
use diagnostics;

our $text =<<EOF;
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
 <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
More text.
EOF

our $regex = qr{^ <br>\n\d+ +[A-Z"“].*?<hr/>}ism;
$text =~ s/($regex)/<!-- Removed -->/;
print "Removed text:\n[$1]\n\n";
print "New text:\n[$text]\n";

That prints:

Removed text:
[ <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>]

New text:
[The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<!-- Removed -->
More text.
]

The qr operator builds a regular expression so that it can be stored in a variable. The ^ at the beginning means to anchor this match at the beginning of a line. The ism on the end stands for case insensitive, single string, multiple embedded lines. s allows . to match line feeds. m allows ^ to match at the beginning of lines embedded in the string. You would add a g flag to end of the substitution to do a global replacement. s///g

The Perl regex documentation explains everything. https://perldoc.perl.org/perlretut

See also Multiline replace in perl with extended expressions not working.

HTH