Does xpath in js2xml let me do something like contains when selecting?

314 views Asked by At

When scraping a page with following javascript code, I want to know the value being assigned to myProp2.

myProp1={col1: 'firstName', col2: 'lastName'};
myProp2='data';

js2xml gives an xpath(), but it isn't letting me do something like contains(), which I can do in Scrapy's xpath().

I was hoping to do:

xpath('//assign[contains(., "myProp2")]/right/*')

to get the value being assigned to myProp2 but it appears that contains(), which I use in Scrapy, isn't available.

My workaround is to do an xpath() select twice, then iterate them in parallel, and grab the target value only after getting a match on the identifier:

import js2xml
from StringIO import StringIO
from lxml import etree

f = StringIO(
"""
<html>
<head>
<script type='text/javascript'>
  myProp1={col1: 'firstName', col2: 'lastName'};
  myProp2='data';
</script>
</head>
<body>
  This has test javascript.
</body>
</html>
""")
tree = etree.parse(f)
for script in tree.xpath('//script/text()'):
    jstree = js2xml.parse(script)
    idtree = jstree.xpath('//assign/left/*')
    valtree = jstree.xpath('//assign/right/*')
    for ids, vals in zip(idtree, valtree):
        id = js2xml.jsonlike.make_dict(ids)
        val = js2xml.jsonlike.make_dict(vals)
        if id == 'myProp2':
            print(val)

I'll be doing this in numerous spots, so something that gives the functionality like contains() does would be useful.

It is probably there somehow and I'm just not figuring it out. Is there some way to do this within js2xml's xpath()?


Update: This ended up being a basic xpath expression question and not something related specifically to js2xml.

For anyone else reading this having an xpath beginner question like this, I've since learned that there are xpath tester sites which are a great help when learning how to write xpath expressions.

2

There are 2 answers

1
paul trmbrth On BEST ANSWER

js2xml.parse returns an lxml XML tree representing the JavaScript instructions. But the identifiers for assignments do not appear as text nodes in the output XML, so you cannot usually do contains(., ...) on an assign node directly, but you can on some of its children attributes.

Let's first look at the XML that js2xml gives you:

>>> s = '''
... myProp1={col1: 'firstName', col2: 'lastName'};
... myProp2='data';'''
>>> import js2xml
>>> jstree = js2xml.parse(s)
>>> print(js2xml.pretty_print(jstree))
<program>
  <assign operator="=">
    <left>
      <identifier name="myProp1"/>
    </left>
    <right>
      <object>
        <property name="col1">
          <string>firstName</string>
        </property>
        <property name="col2">
          <string>lastName</string>
        </property>
      </object>
    </right>
  </assign>
  <assign operator="=">
    <left>
      <identifier name="myProp2"/>
    </left>
    <right>
      <string>data</string>
    </right>
  </assign>
</program>

You can see that "myProp2":

  • is the value of a name attribute attribute
  • of an identifier element,
  • child of a left element
  • within an assign statement.

You can use contains() on the @name attribute and call make_dict on the right element's child (the actual data you want):

>>> js2xml.jsonlike.make_dict(
...     jstree.xpath(
...         '//assign[contains(left//@name, "myProp2")]/right/*')[0]
... )
'data'
1
Stephen Gornick On

Paul had the best answer to the question about how to use contains() for this.

Here's another expression though that provides the same result but doesn't use contains() and instead uses a predicate that makes it easier to see where the match should occur.

//assign[left/identifier[@name="myProp1"]]/right/*