preg_split regex lookback for multiple matches

102 views Asked by At

The goal of my regex is to split on any unicode whitespace, excluding newline and ensure that that newline character is appended to the previous non unicode whitespace character. Currently I am seeing this work, but only for single whitespace characters before the \n.

Using my current Regex:

    $data  = "the\nquick\n brown fox jumped     \nover the lazy dog.";
    $tokenized = preg_split("~(?<=\n)|\p{Z}+(?!\n)~u", $data, -1, PREG_SPLIT_OFFSET_CAPTURE);

Current Result (I have added \n where the "\n" character is present):

Array
(
    [0] => Array
        (
            [0] => the\n

            [1] => 0
        )

    [1] => Array
        (
            [0] => quick\n

            [1] => 4
        )

    [2] => Array
        (
            [0] => 
            [1] => 10
        )

    [3] => Array
        (
            [0] => brown
            [1] => 11
        )

    [4] => Array
        (
            [0] => fox
            [1] => 17
        )

    [5] => Array
        (
            [0] => jumped
            [1] => 21
        )

    [6] => Array
        (
            [0] =>  \n
            [1] => 31
        )

    [7] => Array
        (
            [0] => over
            [1] => 33
        )

    [8] => Array
        (
            [0] => the
            [1] => 38
        )

    [9] => Array
        (
            [0] => lazy
            [1] => 42
        )

    [10] => Array
        (
            [0] => dog.
            [1] => 47
        )
)

Expected result:

Array
(
    [0] => Array
        (
            [0] => the\n
            [1] => 0
        )

    [1] => Array
        (
            [0] => quick\n
            [1] => 4
        )

    [2] => Array
        (
            [0] => brown
            [1] => 10
        )

    [3] => Array
        (
            [0] => fox
            [1] => 16
        )

    [4] => Array
        (
            [0] => jumped\n
            [1] => 20
        )

    [5] => Array
        (
            [0] => over
            [1] => 27
        )

    [6] => Array
        (
            [0] => the
            [1] => 32
        )

    [7] => Array
        (
            [0] => lazy
            [1] => 36
        )

    [8] => Array
        (
            [0] => dog.
            [1] => 41
        )
)

Any advice greatly appreciated. Thanks.

0

There are 0 answers