preg_split regex lookback for multiple matches

87 views Asked by Drew At 19 November 2014 at 14:32

The goal of my regex is to split on any unicode whitespace, excluding newline and ensure that that newline character is appended to the previous non unicode whitespace character. Currently I am seeing this work, but only for single whitespace characters before the \n.

Using my current Regex:

    $data  = "the\nquick\n brown fox jumped     \nover the lazy dog.";
    $tokenized = preg_split("~(?<=\n)|\p{Z}+(?!\n)~u", $data, -1, PREG_SPLIT_OFFSET_CAPTURE);

Current Result (I have added \n where the "\n" character is present):

Array
(
    [0] => Array
        (
            [0] => the\n

            [1] => 0
        )

    [1] => Array
        (
            [0] => quick\n

            [1] => 4
        )

    [2] => Array
        (
            [0] => 
            [1] => 10
        )

    [3] => Array
        (
            [0] => brown
            [1] => 11
        )

    [4] => Array
        (
            [0] => fox
            [1] => 17
        )

    [5] => Array
        (
            [0] => jumped
            [1] => 21
        )

    [6] => Array
        (
            [0] =>  \n
            [1] => 31
        )

    [7] => Array
        (
            [0] => over
            [1] => 33
        )

    [8] => Array
        (
            [0] => the
            [1] => 38
        )

    [9] => Array
        (
            [0] => lazy
            [1] => 42
        )

    [10] => Array
        (
            [0] => dog.
            [1] => 47
        )
)

Expected result:

Array
(
    [0] => Array
        (
            [0] => the\n
            [1] => 0
        )

    [1] => Array
        (
            [0] => quick\n
            [1] => 4
        )

    [2] => Array
        (
            [0] => brown
            [1] => 10
        )

    [3] => Array
        (
            [0] => fox
            [1] => 16
        )

    [4] => Array
        (
            [0] => jumped\n
            [1] => 20
        )

    [5] => Array
        (
            [0] => over
            [1] => 27
        )

    [6] => Array
        (
            [0] => the
            [1] => 32
        )

    [7] => Array
        (
            [0] => lazy
            [1] => 36
        )

    [8] => Array
        (
            [0] => dog.
            [1] => 41
        )
)

Any advice greatly appreciated. Thanks.

Original Q&A

TechQA.

preg_split regex lookback for multiple matches

There are 0 answers

Related Questions in PHP

Related Questions in REGEX

Related Questions in UNICODE

Related Questions in PREG-SPLIT

Related Questions in LOOKBEHIND

Popular Questions

Popular Tags

Trending Questions