Forbidden 403 in perl script

Question

Forbidden 403 in perl script

1.7k views Asked by Balakumar At 28 August 2013 at 09:07

I am trying to search and find a content from a site by using Perl Mechanize.It worked fine in the beginning after few execution i am getting 403 Forbidden instead of the search results,

$m = WWW::Mechanize->new();
$url="http://site.com/search?q=$keyword";
$m->get($url);
$c = $m->content;
print "$c";`

how can solve this problem. Please give me some suggestions.

Original Q&A

There are 1 answers

**amon** · Answer 1 · 2013-08-28T11:30:34+00:00

Before beginning to scrape a site, you should make sure that you are authorized to do so. Most sites have a Terms Of Service (TOS), that lay out how you can use the site. Most sites disallow automatic access, and place strong restrictions on the intellectual property.

A site can defend against unwanted access on three levels:

Conventions: The /robots.txt almost every site has should be honored by your programs. Do not assume that a library you are using will take care of that; honoring the robots.txt is your responsibility. Here is a excerpt from the stackoverflow robots.txt:
```
 User-Agent: *
 Disallow: /ask/
 Disallow: /questions/ask/
 Disallow: /search/
```
So it seems SO doesn't like bots asking questions, or using the site search. Who would have guessed?

It is also expected that a developer will use the API and similar services to access the content. E.g. Stackoverflow has very customizable RSS feeds, has published snapshots of the database, even has an online interface for DB queries, and an API you can use.
Legal: (IANAL!) Before accessing a site for anything other than your personal, immediate consumption, you should read the TOS, or whatever they are called. They state if and how you may access the site and reuse content. Be aware that all content has some copyright. The copyright system is effectively global, so you aren't exempt from the TOS just by being in another country than the site owner.

You implicitly accept the TOS by using a site (by any means).

Some sites license their content to everybody. Good examples are Wikipedia and Stackoverflow, which license user submissions under CC-BY-SA (or rather, the submitting users license their content to the site under this license). They cannot restrict the reuse of content, but can restrict the access to that content. E.g. the Wikipedia TOS contains this a section Refraining from certain activities:

Engaging in Disruptive and Illegal Misuse of Facilities
[…]
- Engaging in automated uses of the site that are abusive or disruptive of the services […]
- […] placing an undue burden on a Project website or the networks or servers connected with a Project website;
- […] traffic that suggests no serious intent to use the Project website for its stated purpose;
- Knowingly accessing, […] or using any of our non-public areas in our computer systems without authorization […]
Of course, this is just meant to make disallow a DDOS, but while Bots are an important part of Wikipedia, other sites do tend to frown on them.
Technical measures: … like letting connections from an infringing IP time out, or sending a 403 error (which is very polite). Some of these measures may be automated (e.g. triggered by useragent strings, weird referrers, URL hacking, fast requests) or by watchful sysadmins tailing the logs.

If the TOS etc. don't make it clear that you may use a bot on the site, you can always ask the site owner for written permission to do so.

If you think there was a misunderstanding, and you are being blocked despite regular use of a site, you can always contact the owner/admin/webmaster and ask them re-open your access.

TechQA.

Forbidden 403 in perl script

There are 1 answers

Related Questions in PERL

Related Questions in HTTP-STATUS-CODE-403

Related Questions in WWW-MECHANIZE

Popular Questions

Popular Tags

Trending Questions