Malformed URL from crawl

454 views Asked by At

I'm quite new to crawling. I crawled a webpage and extracted hyperlinks which I then fed to Apache Nutch 1.18. All the urls were rejected as Malformed. What I'm trying to do is crawl a projects database page, extract their hyperlinks, and then crawl each page separately.

I crawled the database page using Scrapy and saved the result as a Json file. Then I parsed the json file to extract the links, and fed these links to Nutch for a deep crawl of each page.

I have tried to validate these links, and I get that they are all wrong:

def url_check(url):

min_attr = ('scheme' , 'netloc')
try:
    result = urlparse(url)
    if all([result.scheme, result.netloc]):
        print ('correct')
    else:
        print('wrong')
except:
    print ('wrong')

My goal now is to fix these links so that Nutch will accept them.

This is the code I used to extract the links from the JSON file:

if __name__ == '__main__':
print('starting link extraction')
fname = "aifos.json"
with codecs.open(fname, "rb", encoding='utf-8') as f:
    links_data = f.read()
json_data = simplejson.loads(links_data)

all_links =[]
for item in json_data:
    website = item['link']

Can someone help? I have tried a few suggestions, but they keep failing.

Please note that I'm not trying to validate the urls, I have already found that they are invalid. I am trying to fix them. These URLS all work. I have accessed them. I'm not sure now if there is something wrong with my original crawl code. Please see it below. The 'link' object is what I am having the problems with now.

    def parse_dir_content(self, response):
    items = AifosItem()

    #all_projects = response.css('div.node__content')
    title = response.css('span::text').extract()
    country = response.css('.details__item::text').extract()
    link = response.css('dd.details__item.details__item--long a::attr(href)').extract()
    short_description = response.css('.field.field--name-field-short-description.field--type-text-long.field--label-hidden').extract()
    long_description = response.css('.field.field--name-field-long-description.field--type-text-long.field--label-hidden').extract()
    #long_description = response.css('.node__content--main').extract()

    items['title'] = title
    items['country'] = country
    items['link'] = link
    items['short_description'] = short_description
    items['long_description'] = long_description

    yield items

Edit: - The summary here is this - how do I fix malformed urls for a crawler? These urls do work when clicked on, but the crawler is rejecting them as malformed, and when I test them, I get the error that they are not valid. Did I miss a parse? This is why I added my Scrapy crawl code, which was used to extract these urls from the parent page.

1

There are 1 answers

0
Phoenix On

Have fixed this now. Found a way to fix the url here: How can I prepend the 'http://' protocol to a url when necessary?

This fixed the protocols in Nutch, but I also found that I needed to update my regex-urlfilter.txt in nutch as I had put in a regex expression that made the injector reject non-matching urls. A bit embarrassing, that.