I have the following html
<html>
<body>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_1"></a>
<a name="bananabread"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananabread"></a>Ban</font> <font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">ana Bread</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">The Best You Ever Tasted</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">If you don't agree that this is the best banana bread you have ever eaten well I would suggest you see your doctor</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_2"></a>
<a name="bananapudding"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananapudding"></a>Banana</font>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Pudding</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Creamy and Satisfying</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">This is the same recipe your mother used when you were ten!</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
</body>
</html>
I am trying to write an xpath expression to identify Banana Bread - my initial efforts were successful -
b_tree.xpath('.//*[starts-with(text(),"Banana Bread")]')
but I notice the error cases and upon investigation they are like the html above - another element is added inside the content I am searching for. Sometimes it is like above, a possibly unneeded font element, sometimes it is an anchor.
I worked with this answer (Related) but have not been successful
I can check for elements that have text_content() - clean up the text_content and then string match to my ultimate goal but I am hoping to learn to better apply xpath to these types of problems.
To be absolutely clear I need the text_content of the p element. But sometimes I just need the text of a font element. My existing XPATH expression works fine on the cases where there is not an intervening element. I do not know when I open the page the structure that was imposed on the document.
When the
text()expression is applied to an element whose text content is interrupted by other elements, it returns a nodeset consisting of multiple text nodes, of whichstarts-withconsiders only the first. If you replacetext()by., you get the text value of the element, which is the concatenation of all text nodes, and that's what you want.But there is still a problem with the spaces in an element like (attributes omitted, spaces are dots):
The text value of this element is
_.._.._.._....Banana_..Pudding_(underscores represent line feeds), therefore you must applynormalize-space, which normalizes this toBanana.Pudding, so thatfinds this occurrence.
However,
Banana Breadcannot be found, because it does not exist on the page. The elementhas a normalized text value of
Ban.ana.Breadand you don't expect the space inside the wordBanana.normalize-spaceremoves spaces and line feeds that are invisible on the rendered page, but the two spaces inBan.ana.Breadare both visible.If there was no space between the two
<font>elements,would detect 3 elements: the
<html>, the<body>and the<p>, because "Banana Bread" are the first words in each of them. So you might better useinstead.