I would like to split a file into multiple files based on a particular regex pattern. I provide a reproducible example below. If there is an easier solution, I would also welcome it!
I have a directory with the following files:
page1.html page2.html page3.html
Say my page1.html looked like this:
<strong>Hello world</strong>
<p>ABC, Page (1 whatever).</p>
<p>Some text</p>
<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>
<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>
I want to split page1.html to:
page1_0.html
<strong>Hello world</strong>
page1_1.html
<p>ABC, Page (1 whatever).</p>
<p>Some text</p>
page1_2.html
<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>
<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>
I want code that identifies the line with the following pattern:
[0 to 10 characters in the beginning] , Page (1 [0 to 10 characters here]). </p>
I currently have the following code:
for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'Page (1'/ '{*}'
But this is creating a page1_3.html containing the following text:
<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>
But when I run this:
for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'^.{0,10}, Page \(1.{0,10}\).\<\/p\>'/ '{*}'
This just outputs the file page1_0.html.
What is the issue with my regex? Are there any alternative ways to achieve what I'm trying to do?
You could do it with this short Perl script.
Prints:
I think the problem with your regex was that you needed non-greedy matching.
.{0,10}?
zero to ten minimally, Page \(1
[^)]{0,10}?
zero to ten non closing parenthesis minimally\)\.</p>
then the closingHTH