When scraping a page with following javascript code, I want to know the value being assigned to myProp2.
myProp1={col1: 'firstName', col2: 'lastName'};
myProp2='data';
js2xml gives an xpath(), but it isn't letting me do something like contains(), which I can do in Scrapy's xpath().
I was hoping to do:
xpath('//assign[contains(., "myProp2")]/right/*')
to get the value being assigned to myProp2 but it appears that contains(), which I use in Scrapy, isn't available.
My workaround is to do an xpath() select twice, then iterate them in parallel, and grab the target value only after getting a match on the identifier:
import js2xml
from StringIO import StringIO
from lxml import etree
f = StringIO(
"""
<html>
<head>
<script type='text/javascript'>
myProp1={col1: 'firstName', col2: 'lastName'};
myProp2='data';
</script>
</head>
<body>
This has test javascript.
</body>
</html>
""")
tree = etree.parse(f)
for script in tree.xpath('//script/text()'):
jstree = js2xml.parse(script)
idtree = jstree.xpath('//assign/left/*')
valtree = jstree.xpath('//assign/right/*')
for ids, vals in zip(idtree, valtree):
id = js2xml.jsonlike.make_dict(ids)
val = js2xml.jsonlike.make_dict(vals)
if id == 'myProp2':
print(val)
I'll be doing this in numerous spots, so something that gives the functionality like contains() does would be useful.
It is probably there somehow and I'm just not figuring it out. Is there some way to do this within js2xml's xpath()?
Update: This ended up being a basic xpath expression question and not something related specifically to js2xml.
For anyone else reading this having an xpath beginner question like this, I've since learned that there are xpath tester sites which are a great help when learning how to write xpath expressions.
js2xml.parse
returns an lxml XML tree representing the JavaScript instructions. But the identifiers for assignments do not appear as text nodes in the output XML, so you cannot usually docontains(., ...)
on anassign
node directly, but you can on some of its children attributes.Let's first look at the XML that js2xml gives you:
You can see that "myProp2":
name
attribute attributeidentifier
element,left
elementassign
statement.You can use
contains()
on the@name
attribute and callmake_dict
on theright
element's child (the actual data you want):