How to prevent Simple HTML DOM plaintext from concatenating words together for sequential <div> elements

1.1k views Asked by At

I'm parsing a web page which contains the following excerpt:

<div>foo</div><div>bar</div>

With the following code:

$html = file_get_html("http://example.com");
$pt = $html->plaintext;
echo $pt;

$pt returns "foobar". What I want is "foo bar", i.e. add a blank space between words that are in separate elements.

There are other elements besides <div> where I'm seeing this behaviour, so the solution has to be generic across all element types that can contain viewable text.

Is there a way to either manipulate the $html object to add spaces between elements or have plaintext add spaces after each word it finds? I can deal with having a double space in the resulting $pt.

I tried $html = str_replace ( "</" , " </" , $html );, but the result is null, likely because I'm trying to edit an object, not a string, and then the object got mangled.

Update

Based on some feedback, I tried the following:

$webString = file_get_contents("http://example.com");
$webString = str_replace ( "</" , " </" , $webString );  // add a space before all <tag> closures.

$html = new simple_html_dom();
$html->load($webString);

$pt = $html->plaintext;
echo $pt;

This has the desired result, but I don't know if there is a more efficient way of doing this.

4

There are 4 answers

0
limeygent On BEST ANSWER

Because you have no way of determining which elements will generate a result for plaintext, if you read the complete page as a string, you can then do a str_replace to add in a space before each tag close character. (</htmltag>)

The other suggested answers in here depend on knowing what elements contain readable text, but this is not known ahead of time.

This seems to generate the desired effect:

$webString = file_get_contents("http://example.com");
$webString = str_replace ( "</" , " </" , $webString );  // add a space before all <tag> closures.

$html = new simple_html_dom();
$html->load($webString);

$pt = $html->plaintext;
echo $pt;
7
dave On

It's getting concatenated when you use the plaintext method. The following should get you an array of the divs.

$html = file_get_html("http://example.com");
$pt = $html->find('div');
print_r($pt); 
0
John Joseph On

If you use file_get_contents to get a string rather than an object of the HTML, you can use preg_match_all to get all div tags, then apply strip_tags to every matching tag using array_walk to leave you with the value.

Try this:

$str = file_get_contents("some_file_with_your_html.php");
// Assume the above returns something like the below
$str = "<div>sdsd</div><div id='some_id_1' attribute>test</div><div><div>inside</div></div><div><h1>header</h1></div><p>sdscdsds</p><div>another</div>";

// matches all div tags with any optional attributes and CSS declarations
$tagsFound = preg_match_all("|<div([^>]?)+>(.*)</div+>|U", $str, $matches);
if ((bool)$tagsFound !== false) {
    // Apply the anonymous function to each array value
    array_walk($matches[0], function (&$value, $index) {
        $value = strip_tags($value);
    });
}

This will leave you with an array of text within the HTML:

print ('<pre>');
print_r($matches[0]);
print ('</pre>');
Array
    (
        [0] => sdsd
        [1] => test
        [2] => inside
        [3] => header
        [4] => another
    )
    

You can then do an implode on the resulting array to get your words separated by a space, if you need to.

References:

https://www.php.net/manual/en/function.preg-match-all.php

https://www.php.net/manual/en/function.array-walk.php

https://www.php.net/manual/en/function.strip-tags.php

http://php.net/manual/en/pcre.pattern.php

0
tawfiq On

I had this problem and i wanted to get plaintext with bold and I faced the contamination problem, to do this simply do this : first, find all bold text and store them on an array next, you grab the inner text of the element you want finally, strip tags (another step that was only for me is replace all text from bolds array with text from table here)

$elements = $html->find('p');
foreach ($elements as $key => $element) {
        $text = $element->innertext;
        $text = strip_tags($text);
        // one extra step for me only I replace bold texts
}