I'm quite new to crawling. I crawled a webpage and extracted hyperlinks which I then fed to Apache Nutch 1.18. All the urls were rejected as Malformed. What I'm trying to do is crawl a projects database page, extract their hyperlinks, and then crawl each page separately.
I crawled the database page using Scrapy and saved the result as a Json file. Then I parsed the json file to extract the links, and fed these links to Nutch for a deep crawl of each page.
I have tried to validate these links, and I get that they are all wrong:
def url_check(url):
min_attr = ('scheme' , 'netloc')
try:
result = urlparse(url)
if all([result.scheme, result.netloc]):
print ('correct')
else:
print('wrong')
except:
print ('wrong')
My goal now is to fix these links so that Nutch will accept them.
This is the code I used to extract the links from the JSON file:
if __name__ == '__main__':
print('starting link extraction')
fname = "aifos.json"
with codecs.open(fname, "rb", encoding='utf-8') as f:
links_data = f.read()
json_data = simplejson.loads(links_data)
all_links =[]
for item in json_data:
website = item['link']
Can someone help? I have tried a few suggestions, but they keep failing.
Please note that I'm not trying to validate the urls, I have already found that they are invalid. I am trying to fix them. These URLS all work. I have accessed them. I'm not sure now if there is something wrong with my original crawl code. Please see it below. The 'link' object is what I am having the problems with now.
def parse_dir_content(self, response):
items = AifosItem()
#all_projects = response.css('div.node__content')
title = response.css('span::text').extract()
country = response.css('.details__item::text').extract()
link = response.css('dd.details__item.details__item--long a::attr(href)').extract()
short_description = response.css('.field.field--name-field-short-description.field--type-text-long.field--label-hidden').extract()
long_description = response.css('.field.field--name-field-long-description.field--type-text-long.field--label-hidden').extract()
#long_description = response.css('.node__content--main').extract()
items['title'] = title
items['country'] = country
items['link'] = link
items['short_description'] = short_description
items['long_description'] = long_description
yield items
Edit: - The summary here is this - how do I fix malformed urls for a crawler? These urls do work when clicked on, but the crawler is rejecting them as malformed, and when I test them, I get the error that they are not valid. Did I miss a parse? This is why I added my Scrapy crawl code, which was used to extract these urls from the parent page.
Have fixed this now. Found a way to fix the url here: How can I prepend the 'http://' protocol to a url when necessary?
This fixed the protocols in Nutch, but I also found that I needed to update my regex-urlfilter.txt in nutch as I had put in a regex expression that made the injector reject non-matching urls. A bit embarrassing, that.