Here is the link https://www.google.com/about/careers/search#!t=jo&jid=34154& which i have to extract content under job details.
Job details
Team or role: Software Engineering // How to write xapth
Job type: Full-time // How to write xapth
Last updated: Oct 17, 2014 // How to write xapth
Job location(s):Seattle, WA, USA; Kirkland, WA, USA //// How to write rejax for to extract city, state and country separately for each jobs. Also i need to filter USA, canada and UK jobs separately.
Here I have added html code for to extract the above content:
<div class="detail-content">
<div>
<div class="greytext info" style="display: inline-block;">Team or role:</div>
<div class="info-text" style="display: inline-block;">Software Engineering</div> // How to write xpath for this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job type:</div>
<div class="info-text" style="display: inline-block;" itemprop="employmentType">Full-time</div>// How to write xpath for job type this one
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Job level:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Salary:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div>
<div class="greytext info" style="display: inline-block;">Last updated:</div>
<div class="info-text" style="display: inline-block;" itemprop="datePosted"> Oct 17, 2014</div> // How to write xpath for posted date this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job location(s):</div>
<div class="info-text" style="display: inline-block;">Seattle, WA, USA; Kirkland, WA, USA</div> // How to write rejax for to extract city, state and country seprately
</div>
</div>
</div>
Here is spider code:
def parse_listing_page(self,response):
selector = Selector(response)
item=googleSpiderItem()
item['CompanyName'] = "Google"
item ['JobDetailUrl'] = response.url
item['Title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()
item['City'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('(.)\,.')
item['State'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('\,(.)')
item['Jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()
Description = selector.xpath("string(//div[@itemprop='description'])").extract()
item['Description'] = [d.encode('UTF-8') for d in Description]
print "Done!"
yield item
Output is:
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
for x in result:
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/sureshp/Downloads/wwwgooglecom/wwwgooglecom/spiders/googlepage.py", line 49, in parse_listing_page
> **item['City'] = selector.xpath("//a[@class='source
> sr-filter']/span[@itemprop='name']/text()").extract().re('(.*)\,.')
> exceptions.AttributeError: 'list' object has no attribute 're'**
I have noticed you have some typo errors in the parse code.
I fixed it. Now the output is.
here it is the modified code:
For separate City, State and Nation you can use a cycle on the selector: