PHP Regex to extract columns of text delimited by multiple spaces

559 views Asked by At

I have a chunk of text extracted from a tabular layout that resembles this:

Waiting Period                             30 days of employment                 30 days of employment                    30 days of employment
 Benefit amount                                   Flat $150,000                        Flat $100,000                              Flat $60,000
 Maximum benefit                                    $150,000                              $100,000                                  $60,000
 Contributions                                  Noncontributory                       Noncontributory                           Noncontributory
   Participation requirement                         100.00%                               100.00%                                  100.00%
---
Benefit amount                                            Flat $40,000                                                  Flat $20,000
 Maximum benefit                                             $40,000                                                       $20,000
 Compulsory coverage                                            Yes                                                           Yes
 Contributions                                           Noncontributory                                               Noncontributory
Waiting Period                                        30 days of employment                                      30 days of employment

Phrases like Waiting Period, or Contributions are labels for the row. A variable number of columns then follow, separated by a variable number of whitespaces.

I am struggling to land on a regular expression that can target a particular row based on the label, and then extract the content of those variable number of columns. I think I have constructed the label identifier and the capture groups to identify the columns. But the expression seems to stop at the first match.

(?:\s*Waiting Period)(?:(?:\s{2,})(.*?)(?:\s{2,}|\n|$))

The above expression in preg_match_all:

preg_match_all("/(?:\s*Waiting Period)(?:(?:\s{2,})(.*?)(?:\s{2,}|\n))/", $input_lines, $output_array);

produces:

array(2) {
0   =>  array(2){
                    0   =>  Waiting Period                             30 days of employment
                    1   =>  
                            Waiting Period                                        30 days of employment 
                }
1   =>  array(2){
                    0   =>  30 days of employment
                    1   =>  30 days of employment   
                }
}

As you can see, the match correctly identifies the target rows based on the label, extracts the first column and quits. I don't know how to instruct the process to keep going until each matched row is processed until the end of line.

My question is: is my regex approach salvageable to accomplish my objective? Or, have I misunderstood preg_match_all and will only ever get one instance of the capture subgroup?

1

There are 1 answers

3
Avinash Raj On

This is because (?:\s{2,}|\n) two or more spaces or newline character. So your regex stops until it finds another set of continuous space characters.

^\s*Waiting Period\s{2,}(.*)

DEMO