How do I split a file into multiple file based on a RegEx pattern?

159 views Asked by At

I would like to split a file into multiple files based on a particular regex pattern. I provide a reproducible example below. If there is an easier solution, I would also welcome it!

I have a directory with the following files:

page1.html page2.html page3.html

Say my page1.html looked like this:

<strong>Hello world</strong>

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

I want to split page1.html to:

page1_0.html

<strong>Hello world</strong>

page1_1.html

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

page1_2.html

<p>DEF,  Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

I want code that identifies the line with the following pattern:

[0 to 10 characters in the beginning] , Page (1 [0 to 10 characters here]). </p>

I currently have the following code:

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'Page (1'/ '{*}'

But this is creating a page1_3.html containing the following text:

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

But when I run this:

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'^.{0,10}, Page \(1.{0,10}\).\<\/p\>'/ '{*}'

This just outputs the file page1_0.html.

What is the issue with my regex? Are there any alternative ways to achieve what I'm trying to do?

2

There are 2 answers

4
lordadmira On

You could do it with this short Perl script.

#chunker.pl
use 5.022;
use strict;
use diagnostics;
use B "perlstring";

our $i = 0;
our $fmt = "page1_%d.html";
our $fn = sprintf $fmt, $i;

open our $fh, ">", $fn or die $!;
print "opened $fn\n";
while (<<>>) {
  printf "read line $.: %s\n", perlstring $_;
  if (m{^.{0,10}?, Page \(1 [^)]{0,10}?\)\.</p>}) {
    print "break matched line $.\n";
    $fn = sprintf $fmt, ++$i;
    open $fh, ">", $fn or die $!;
    print "opened $fn\n";
  }
  print $fh $_;
}

Prints:

$ perl chunker.pl page1.html

opened page1_0.html
read line 1: "<strong>Hello world</strong>\n"
read line 2: "\n"
read line 3: "<p>ABC, Page (1 whatever).</p>\n"
break matched line 3
opened page1_1.html
read line 4: "<p>Some text</p>\n"
read line 5: "\n"
read line 6: "<p>DEF, Page (1 ummm what).</p>\n"
break matched line 6
opened page1_2.html
read line 7: "<p>Some text</p>\n"
read line 8: "\n"
read line 9: "<p>THE<em><strong><span class=\"underline\">GHI</span></strong></em>JK <em><strong><span class=\"underline\">the</span></strong></em>LMNOP<em><strong><span class=\"underline\">Q</span></strong></em>RS.<p> ABC, Page (1).</p>\n"
read line 10: "\n"
read line 11: "\n"



$ for f in page1_*.html; do echo "$f:"; cat $f; echo; done;
page1_0.html:
<strong>Hello world</strong>


page1_1.html:
<p>ABC, Page (1 whatever).</p>
<p>Some text</p>


page1_2.html:
<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>


I think the problem with your regex was that you needed non-greedy matching.

.{0,10}? zero to ten minimally
, Page \(1
[^)]{0,10}? zero to ten non closing parenthesis minimally
\)\.</p> then the closing

HTH

3
AudioBubble On

^.{0,10}, Page \(1.{0,10}\).\<\/p\>

What is the issue with my regex?

It's not a POSIX BRE. Try ^.\{0,10\}, Page (1.\{0,10\}).<\/p>.

The / is \/ only because it is to be used as a /REGEXP/[offset] argument for the csplit tool. You may want to change that last . to \. to match your dot character.