Split text into an array of N elements with a balanced number of words in each element

2k views Asked by At

I wanted to split a large text into 10 pieces (somehow equal parts). Ii use this function:

function chunk($msg) {
    $msg = preg_replace('/[\r\n]+/', ' ', $msg);
    //define character length of each text piece
    $chunks = wordwrap($msg, 10000, '\n');
    return explode('\n', $chunks);
}

$arrayys = chunk($t);
foreach ($arrayys as $partt) {
    echo $partt . "<br/><br/><br/>";
}

But is it possible to define word length of each text piece (not character length )? how to divide text into words in such situation?

4

There are 4 answers

0
user1915746 On BEST ANSWER

I would suggest to use "explode" http://php.net/manual/en/function.explode.php for splitting the string by spaces. Then you'll get an array of words on which you can iterate and build your text-parts.

0
Shivanshu On

From docs,

<?php
$text = "ABCDEFGHIJK.";
$newtext = wordwrap($text,3,"\n",true);
echo "$newtext\n";
?>

OUTPUT: ABC DEF GHI JK.

0
Shankar Narayana Damodaran On

You can do something like this. Breaks your text into equal parts.. The text in $str is of 20 chars, So the text is broken into 10 parts with 2 chars as a set.

Say, if your large text is of 1000 characters, then you will be getting 100 equal parts of text.

<?php
$div=10;//Equally split into 10 ...
$str="abcdefghijklmnopqrst";
print_r(array_chunk(str_split($str), (strlen($str)/($div))));

OUTPUT:

Array
(
    [0] => Array
        (
            [0] => a
            [1] => b
        )

    [1] => Array
        (
            [0] => c
            [1] => d
        )

    [2] => Array
        (
            [0] => e
            [1] => f
        )

    [3] => Array
        (
            [0] => g
            [1] => h
        )

    [4] => Array
        (
            [0] => i
            [1] => j
        )

    [5] => Array
        (
            [0] => k
            [1] => l
        )

    [6] => Array
        (
            [0] => m
            [1] => n
        )

    [7] => Array
        (
            [0] => o
            [1] => p
        )

    [8] => Array
        (
            [0] => q
            [1] => r
        )

    [9] => Array
        (
            [0] => s
            [1] => t
        )

)
0
mickmackusa On
  • find the offset of each "word" in the text,
  • count the words then divide by 10 to determine the desired number of words per group,
  • isolate the first offset of each group,
  • extract segments of the original text between one group's first word offset and the next group's first word offset.

Code: (Demo)

$offsets = array_keys(str_word_count($text, 2));
$totalPerGroup = intdiv(count($offsets), 10);
$chunks = array_chunk($offsets, $totalPerGroup);
$starts = array_column($chunks, 0);
var_export(
    array_map(
        fn($start, $end) => substr($text, $start, $end ? $end - $start : $end),
        $starts,
        array_slice($starts, 1) + [null]
    )
);

Sample input:

$text = <<<TEXT
The answer was within her reach. It was hidden in a box and now that box sat directly in front of her. She'd spent years searching for it and could hardly believe she'd finally managed to find it. She turned the key to unlock the box and then gently lifted the top. She held her breath in anticipation of finally knowing the answer she had spent so much of her time in search of. As the lid came off she could see that the box was empty.
TEXT;

Output:

array (
  0 => 'The answer was within her reach. It was ',
  1 => 'hidden in a box and now that box ',
  2 => 'sat directly in front of her. She\'d spent ',
  3 => 'years searching for it and could hardly believe ',
  4 => 'she\'d finally managed to find it. She turned ',
  5 => 'the key to unlock the box and then ',
  6 => 'gently lifted the top. She held her breath ',
  7 => 'in anticipation of finally knowing the answer she ',
  8 => 'had spent so much of her time in ',
  9 => 'search of. As the lid came off she ',
  10 => 'could see that the box was empty.',
)

Of course, to remove trailing whitespaces, wrap substr() in a rtrim() call.