preg_split regex for splitting on a range but retaining some of that range as the string suffix

210 views Asked by At

Working on the text:

This is some text
which I am working   on.
This text has whitespace before the new line but after this word 
Another line.

I am using preg_split to split on the unicode whitespace and all special characters excluding newline like so:

preg_split("/\p{Z}|[^\S\n]/u", $data, -1, PREG_SPLIT_OFFSET_CAPTURE);

The flag is because I absolutely need to retain the positions of the strings.

I would like to have the preg_split keep the newlines with their preceeding word. For example the newline can appear at the beginning of the following word, or even on their own.

Expected output when working correctly:

This
is
some
text\n
which
I
am
working
on.\n
This
text
has
whitespace
before
the
new
line
but
after
this
word\n
Another
line.

Can anyone explain how this might be achieved? Thanks

1

There are 1 answers

6
Avinash Raj On

Use a lookbehind to match the boundary which exists after newline character.

<?php
$str = <<<EOT
This is some text
which I am working   on.
This text has whitespace before the new line but after this word 
Another line.
EOT;
$splits = preg_split("~(?<=\n)|\p{Z}+(?!\n)~", $str);
print_r($splits);
?>

Output:

Array
(
    [0] => This
    [1] => is
    [2] => some
    [3] => text

    [4] => which
    [5] => I
    [6] => am
    [7] => working
    [8] => on.

    [9] => This
    [10] => text
    [11] => has
    [12] => whitespace
    [13] => before
    [14] => the
    [15] => new
    [16] => line
    [17] => but
    [18] => after
    [19] => this
    [20] => word 

    [21] => Another
    [22] => line.
)