How to prevent Simple HTML DOM plaintext from concatenating words together for sequential <div> elements

Question

How to prevent Simple HTML DOM plaintext from concatenating words together for sequential <div> elements

1.1k views Asked by limeygent At 20 December 2016 at 16:14

I'm parsing a web page which contains the following excerpt:

<div>foo</div><div>bar</div>

With the following code:

$html = file_get_html("http://example.com");
$pt = $html->plaintext;
echo $pt;

$pt returns "foobar". What I want is "foo bar", i.e. add a blank space between words that are in separate elements.

There are other elements besides <div> where I'm seeing this behaviour, so the solution has to be generic across all element types that can contain viewable text.

Is there a way to either manipulate the $html object to add spaces between elements or have plaintext add spaces after each word it finds? I can deal with having a double space in the resulting $pt.

I tried $html = str_replace ( "</" , " </" , $html );, but the result is null, likely because I'm trying to edit an object, not a string, and then the object got mangled.

Update

Based on some feedback, I tried the following:

$webString = file_get_contents("http://example.com");
$webString = str_replace ( "</" , " </" , $webString );  // add a space before all <tag> closures.

$html = new simple_html_dom();
$html->load($webString);

$pt = $html->plaintext;
echo $pt;

This has the desired result, but I don't know if there is a more efficient way of doing this.

Original Q&A

There are 4 answers

dave On 20 December 2016 at 16:25

It's getting concatenated when you use the plaintext method. The following should get you an array of the divs.

$html = file_get_html("http://example.com");
$pt = $html->find('div');
print_r($pt);

John Joseph On 21 December 2016 at 10:08

If you use file_get_contents to get a string rather than an object of the HTML, you can use preg_match_all to get all div tags, then apply strip_tags to every matching tag using array_walk to leave you with the value.

Try this:

$str = file_get_contents("some_file_with_your_html.php");
// Assume the above returns something like the below
$str = "<div>sdsd</div><div id='some_id_1' attribute>test</div><div><div>inside</div></div><div><h1>header</h1></div><p>sdscdsds</p><div>another</div>";

// matches all div tags with any optional attributes and CSS declarations
$tagsFound = preg_match_all("|<div([^>]?)+>(.*)</div+>|U", $str, $matches);
if ((bool)$tagsFound !== false) {
    // Apply the anonymous function to each array value
    array_walk($matches[0], function (&$value, $index) {
        $value = strip_tags($value);
    });
}

This will leave you with an array of text within the HTML:

print ('<pre>');
print_r($matches[0]);
print ('</pre>');

Array
    (
        [0] => sdsd
        [1] => test
        [2] => inside
        [3] => header
        [4] => another
    )

You can then do an implode on the resulting array to get your words separated by a space, if you need to.

References:

https://www.php.net/manual/en/function.preg-match-all.php

https://www.php.net/manual/en/function.array-walk.php

https://www.php.net/manual/en/function.strip-tags.php

http://php.net/manual/en/pcre.pattern.php

tawfiq On 05 September 2020 at 10:10

I had this problem and i wanted to get plaintext with bold and I faced the contamination problem, to do this simply do this : first, find all bold text and store them on an array next, you grab the inner text of the element you want finally, strip tags (another step that was only for me is replace all text from bolds array with text from table here)

$elements = $html->find('p');
foreach ($elements as $key => $element) {
        $text = $element->innertext;
        $text = strip_tags($text);
        // one extra step for me only I replace bold texts
}

**limeygent** · Accepted Answer · 2017-01-07T02:36:10+00:00

Because you have no way of determining which elements will generate a result for plaintext, if you read the complete page as a string, you can then do a str_replace to add in a space before each tag close character. (</htmltag>)

The other suggested answers in here depend on knowing what elements contain readable text, but this is not known ahead of time.

This seems to generate the desired effect:

$webString = file_get_contents("http://example.com");
$webString = str_replace ( "</" , " </" , $webString );  // add a space before all <tag> closures.

$html = new simple_html_dom();
$html->load($webString);

$pt = $html->plaintext;
echo $pt;

TechQA.

How to prevent Simple HTML DOM plaintext from concatenating words together for sequential <div> elements

There are 4 answers

Related Questions in PHP

Related Questions in SIMPLE-HTML-DOM

Popular Questions

Popular Tags

Trending Questions