I am trying to search and find a content from a site by using Perl Mechanize.It worked fine in the beginning after few execution i am getting 403 Forbidden instead of the search results,
$m = WWW::Mechanize->new();
$url="http://site.com/search?q=$keyword";
$m->get($url);
$c = $m->content;
print "$c";`
how can solve this problem. Please give me some suggestions.
Before beginning to scrape a site, you should make sure that you are authorized to do so. Most sites have a Terms Of Service (TOS), that lay out how you can use the site. Most sites disallow automatic access, and place strong restrictions on the intellectual property.
A site can defend against unwanted access on three levels:
Conventions: The
/robots.txt
almost every site has should be honored by your programs. Do not assume that a library you are using will take care of that; honoring the robots.txt is your responsibility. Here is a excerpt from the stackoverflowrobots.txt
:So it seems SO doesn't like bots asking questions, or using the site search. Who would have guessed?
It is also expected that a developer will use the API and similar services to access the content. E.g. Stackoverflow has very customizable RSS feeds, has published snapshots of the database, even has an online interface for DB queries, and an API you can use.
Legal: (IANAL!) Before accessing a site for anything other than your personal, immediate consumption, you should read the TOS, or whatever they are called. They state if and how you may access the site and reuse content. Be aware that all content has some copyright. The copyright system is effectively global, so you aren't exempt from the TOS just by being in another country than the site owner.
You implicitly accept the TOS by using a site (by any means).
Some sites license their content to everybody. Good examples are Wikipedia and Stackoverflow, which license user submissions under CC-BY-SA (or rather, the submitting users license their content to the site under this license). They cannot restrict the reuse of content, but can restrict the access to that content. E.g. the Wikipedia TOS contains this a section Refraining from certain activities:
Of course, this is just meant to make disallow a DDOS, but while Bots are an important part of Wikipedia, other sites do tend to frown on them.
Technical measures: … like letting connections from an infringing IP time out, or sending a 403 error (which is very polite). Some of these measures may be automated (e.g. triggered by useragent strings, weird referrers, URL hacking, fast requests) or by watchful sysadmins
tail
ing the logs.If the TOS etc. don't make it clear that you may use a bot on the site, you can always ask the site owner for written permission to do so.
If you think there was a misunderstanding, and you are being blocked despite regular use of a site, you can always contact the owner/admin/webmaster and ask them re-open your access.