Xpath "not" includes elements regardless

88 views Asked by At

My code returns the text only from the body of a webpage. I am trying to remove text from class="menu" items from the body of this page:

<div id="pre-header-links-inner" class="header-links"><ul id="menu-top-bar" class="menu"><li id="menu-item-22" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-22"><a href="tel:000-000-0000">Main Line: +1 000-000-0000</a></li>
<li id="menu-item-23" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-23"><a href="tel:100000000000">Sales: tel:000-000-0000</a></li>
<li id="menu-item-24" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-24"><a href="mailto:[email protected]">Email: [email protected]</a></li>
</ul></div>         
        </div>
        </div>
        </div>
        <!-- #pre-header -->

        <div id="header">
        <div id="header-core">

            <div id="logo">
            <a href="https://www.example.com/" class="custom-logo-link" rel="home" itemprop="url"><img width="253" height="50" src="https://www.example.com/logo.png" class="custom-logo" alt="Domain" itemprop="logo" /></a>           </div>

            <div id="header-links" class="main-navigation">
            <div id="header-links-inner" class="header-links">

                <ul id="menu-main-navigation" class="menu"><li id="menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://www.example.com/"><span>Home</span></a></li>
<li id="menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com"><span>About Us</span></a></li>
<li id="menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul>               

            </div>
            </div>
            <!-- #header-links .main-navigation -->

            <div id="header-nav"><a class="btn-navbar" data-toggle="collapse" data-target=".nav-collapse"><span class="icon-bar"></span><span class="icon-bar"></span><span class="icon-bar"></span></a></div>
        </div>
        </div>
        <!-- #header -->

        <div id="header-responsive"><div id="header-responsive-inner" class="responsive-links nav-collapse collapse"><ul id="menu-main-navigation-1" class=""><li id="res-menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://example.com/"><span>Home</span></a></li>
<li id="res-menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/about-us/"><span>About Us</span></a></li>
<li id="res-menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="res-menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="res-menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul></div></div>
                <div id="header-sticky">
        <div id="header-sticky-core">

            <div id="logo-sticky">
            <a href="https://www.example.com/" class="custom-logo-link" rel="home" itemprop="url"><img width="253" height="50" src="https://www.example.com/logo.png" class="custom-logo" alt="Logo" itemprop="logo" /></a>         </div>

            <div id="header-sticky-links" class="main-navigation">
            <div id="header-sticky-links-inner" class="header-links">

                <ul id="menu-main-navigation-2" class="menu"><li id="menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://www.example.com/"><span>Home</span></a></li>
<li id="menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/about-us/"><span>About Us</span></a></li>
<li id="menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul>   

Strange thing is - when I call the following line:

text = "".join(tree.xpath("//body//*[not(@class='menu')]//text()")).strip()

it returns the entire plain-text source code as-is (ie. even with the text from class="text" elements).

However, when I remove the not keyword:

text = "".join(tree.xpath("//body//*[(@class='menu')]//text()")).strip()

... it correctly identifies the text from the class="text" elements and isolates their text perfectly:

Main Line: +000-000-0000
Sales: +1 000-000-0000
Email: [email protected]
Home
About Us
Services
API
Contact Us
Home
About Us
Services
API
Contact Us

What am I doing it wrong? I'd like it to return the text from everything EXCEPT elements where the class='menu'.

1

There are 1 answers

0
Michael Kay On

it returns the entire plain-text source code

You need to be clear about the distinction between what the XPath expression SELECTS, and what the application handling the XPath result DISPLAYS.

XPath returns a set of nodes, and it is very common practice for the calling application to display each of those nodes by showing the entire subtree rooted at that node. But it's not XPath that's doing this; it's the calling application. Your selection criteria determine what nodes are selected by the XPath expression, but they don't affect which descendants of those selected nodes are displayed by the calling application.