The goal of my regex is to split on any unicode whitespace, excluding newline and ensure that that newline character is appended to the previous non unicode whitespace character. Currently I am seeing this work, but only for single whitespace characters before the \n.
Using my current Regex:
$data = "the\nquick\n brown fox jumped \nover the lazy dog.";
$tokenized = preg_split("~(?<=\n)|\p{Z}+(?!\n)~u", $data, -1, PREG_SPLIT_OFFSET_CAPTURE);
Current Result (I have added \n where the "\n" character is present):
Array
(
[0] => Array
(
[0] => the\n
[1] => 0
)
[1] => Array
(
[0] => quick\n
[1] => 4
)
[2] => Array
(
[0] =>
[1] => 10
)
[3] => Array
(
[0] => brown
[1] => 11
)
[4] => Array
(
[0] => fox
[1] => 17
)
[5] => Array
(
[0] => jumped
[1] => 21
)
[6] => Array
(
[0] => \n
[1] => 31
)
[7] => Array
(
[0] => over
[1] => 33
)
[8] => Array
(
[0] => the
[1] => 38
)
[9] => Array
(
[0] => lazy
[1] => 42
)
[10] => Array
(
[0] => dog.
[1] => 47
)
)
Expected result:
Array
(
[0] => Array
(
[0] => the\n
[1] => 0
)
[1] => Array
(
[0] => quick\n
[1] => 4
)
[2] => Array
(
[0] => brown
[1] => 10
)
[3] => Array
(
[0] => fox
[1] => 16
)
[4] => Array
(
[0] => jumped\n
[1] => 20
)
[5] => Array
(
[0] => over
[1] => 27
)
[6] => Array
(
[0] => the
[1] => 32
)
[7] => Array
(
[0] => lazy
[1] => 36
)
[8] => Array
(
[0] => dog.
[1] => 41
)
)
Any advice greatly appreciated. Thanks.