I'm parsing a web page which contains the following excerpt:
<div>foo</div><div>bar</div>
With the following code:
$html = file_get_html("http://example.com");
$pt = $html->plaintext;
echo $pt;
$pt
returns "foobar". What I want is "foo bar", i.e. add a blank space between words that are in separate elements.
There are other elements besides <div>
where I'm seeing this behaviour, so the solution has to be generic across all element types that can contain viewable text.
Is there a way to either manipulate the $html
object to add spaces between elements or have plaintext
add spaces after each word it finds? I can deal with having a double space in the resulting $pt
.
I tried $html = str_replace ( "</" , " </" , $html );
, but the result is null, likely because I'm trying to edit an object, not a string, and then the object got mangled.
Update
Based on some feedback, I tried the following:
$webString = file_get_contents("http://example.com");
$webString = str_replace ( "</" , " </" , $webString ); // add a space before all <tag> closures.
$html = new simple_html_dom();
$html->load($webString);
$pt = $html->plaintext;
echo $pt;
This has the desired result, but I don't know if there is a more efficient way of doing this.
Because you have no way of determining which elements will generate a result for
plaintext
, if you read the complete page as a string, you can then do astr_replace
to add in a space before each tag close character. (</htmltag>
)The other suggested answers in here depend on knowing what elements contain readable text, but this is not known ahead of time.
This seems to generate the desired effect: