How to write rejax and xpath for the below link?

190 views Asked by At

Here is the link https://www.google.com/about/careers/search#!t=jo&jid=34154& which i have to extract content under job details.

Job details

Team or role: Software Engineering // How to write xapth
Job type: Full-time // How to write xapth
Last updated: Oct 17, 2014 // How to write xapth
Job location(s):Seattle, WA, USA; Kirkland, WA, USA //// How to write rejax for to extract city, state and country separately for each jobs. Also i need to filter USA, canada and UK jobs separately.

Here I have added html code for to extract the above content:

<div class="detail-content">
<div>
<div class="greytext info" style="display: inline-block;">Team or role:</div>
<div class="info-text" style="display: inline-block;">Software Engineering</div> // How to write xpath for this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job type:</div>
<div class="info-text" style="display: inline-block;" itemprop="employmentType">Full-time</div>// How to write xpath for job type this one
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Job level:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Salary:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div>
<div class="greytext info" style="display: inline-block;">Last updated:</div>
<div class="info-text" style="display: inline-block;" itemprop="datePosted"> Oct 17, 2014</div> // How to write xpath for posted date this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job location(s):</div>
<div class="info-text" style="display: inline-block;">Seattle, WA, USA; Kirkland, WA, USA</div> // How to write rejax for to extract city, state and country seprately
</div>
</div>
</div>

Here is spider code:

def parse_listing_page(self,response):
        selector = Selector(response)
        item=googleSpiderItem()
        item['CompanyName'] = "Google" 
        item ['JobDetailUrl'] = response.url
        item['Title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()
        item['City'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('(.)\,.')
        item['State'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('\,(.)')
        item['Jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()
        Description = selector.xpath("string(//div[@itemprop='description'])").extract()
 item['Description'] = [d.encode('UTF-8') for d in Description]
 print "Done!"
        yield item

Output is:

 Traceback (most recent call last):
   File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
     call.func(*call.args, **call.kw)
   File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
     taskObj._oneWorkUnit()
   File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
     result = next(self._iterator)
   File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
     work = (callable(elem, *args, **named) for elem in iterable)
 --- <exception caught here> ---
   File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
     yield next(it)
   File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
     for x in result:
   File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
     return (_set_referer(r) for r in result or ())
   File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
     return (r for r in result or () if _filter(r))
   File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
     return (r for r in result or () if _filter(r))
   File "/home/sureshp/Downloads/wwwgooglecom/wwwgooglecom/spiders/googlepage.py", line 49, in parse_listing_page
    

>  **item['City'] = selector.xpath("//a[@class='source
> sr-filter']/span[@itemprop='name']/text()").extract().re('(.*)\,.')
>      exceptions.AttributeError: 'list' object has no attribute 're'**

1

There are 1 answers

4
aberna On BEST ANSWER

I have noticed you have some typo errors in the parse code.

I fixed it. Now the output is.

{'City': [u'Seattle, WA, USA', u'Kirkland, WA, USA'],
 'CompanyName': 'Google',
 'Description': [u"Google's software engineers develop the next-generation technologies that change how millions of users connect, explore, and interact with information and one another. Our ambitions reach far beyond just Search. Our products need to handle information at the the scale of the web. We're looking for ideas from every area of computer science, including information retrieval, artificial intelligence, natural language processing, distributed computing, large-scale system design, networking, security, data compression, and user interface design; the list goes on and is growing every day. As a software engineer, you work on a small team and can switch teams and projects as our fast-paced business grows and evolves. We need our engineers to be versatile and passionate to tackle new problems as we continue to push technology forward.?\nWith your technical expertise you manage individual projects priorities, deadlines and deliverables. You design, develop, test, deploy, maintain, and enhance software solutions.\n\nSeattle/Kirkland engineering teams are involved in the development of several of Google?s most popular products: Cloud Platform, Hangouts/Google+, Maps/Geo, Advertising, Chrome OS/Browser, Android, Machine Intelligence. Our engineers need to be versatile and willing to tackle new problems as we continue to push technology forward."],
 'JobDetailUrl': 'https://www.google.com/about/careers/search?_escaped_fragment_=t%3Djo%26jid%3D34154%26',
 'Jobtype': [],
 'State': [u'Seattle, WA, USA', u'Kirkland, WA, USA'],
 'Title': [u'Software Engineer']}

here it is the modified code:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Google.items import GoogleItem
import re
class DmozSpider(Spider):
    name = "google"
    allowed_domains = ["google.com"]
    start_urls = [
    "https://www.google.com/about/careers/search#!t=jo&jid=34154&",
    ]

    def parse(self, response):
        selector = Selector(response)
        item=GoogleItem()
        item['Description'] = selector.xpath("string(//div[@itemprop='description'])").extract()
        item['CompanyName'] = "Google"  
        item ['JobDetailUrl'] = response.url
        item['Title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()
        item['City'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()
        item['State'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()
        item['Jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()

        yield item

For separate City, State and Nation you can use a cycle on the selector:

for p in selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract():
    city,state,nation= p.split(',')
    item['City'] =  city
    item['State'] =  state
    item['Nation'] =  nation