I have a chunk of text extracted from a tabular layout that resembles this:
Waiting Period 30 days of employment 30 days of employment 30 days of employment
Benefit amount Flat $150,000 Flat $100,000 Flat $60,000
Maximum benefit $150,000 $100,000 $60,000
Contributions Noncontributory Noncontributory Noncontributory
Participation requirement 100.00% 100.00% 100.00%
---
Benefit amount Flat $40,000 Flat $20,000
Maximum benefit $40,000 $20,000
Compulsory coverage Yes Yes
Contributions Noncontributory Noncontributory
Waiting Period 30 days of employment 30 days of employment
Phrases like Waiting Period
, or Contributions
are labels for the row. A variable number of columns then follow, separated by a variable number of whitespaces.
I am struggling to land on a regular expression that can target a particular row based on the label, and then extract the content of those variable number of columns. I think I have constructed the label identifier and the capture groups to identify the columns. But the expression seems to stop at the first match.
(?:\s*Waiting Period)(?:(?:\s{2,})(.*?)(?:\s{2,}|\n|$))
The above expression in preg_match_all:
preg_match_all("/(?:\s*Waiting Period)(?:(?:\s{2,})(.*?)(?:\s{2,}|\n))/", $input_lines, $output_array);
produces:
array(2) {
0 => array(2){
0 => Waiting Period 30 days of employment
1 =>
Waiting Period 30 days of employment
}
1 => array(2){
0 => 30 days of employment
1 => 30 days of employment
}
}
As you can see, the match correctly identifies the target rows based on the label, extracts the first column and quits. I don't know how to instruct the process to keep going until each matched row is processed until the end of line.
My question is: is my regex approach salvageable to accomplish my objective? Or, have I misunderstood preg_match_all and will only ever get one instance of the capture subgroup?
This is because
(?:\s{2,}|\n)
two or more spaces or newline character. So your regex stops until it finds another set of continuous space characters.DEMO