I am trying to crawl a website, more specifically a Google Site
using ManifoldCF
that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302
redirection to login page and then says RESPONSECODENOTINDEXABLE
.
I am not sure if have I authenticated correctly or not. In manifoldCF we have options for HTTP basic
authentication, NTLM authentication
and Session-based
access credentials authentication method. I used Session based
authentication method which more looks like a form based authentication rather than SAML
authentication.
Has anybody crawled a website using manifoldCF which has SAML
authentication? And if not manifoldCF
, has anyone been able to accomplish this via Apache Nutch, because I am afraid, it also provides only HTTP
basic , Digest
and NTLM
authentication.
Any insight would be helpful. Can provide more information regarding the issue, if anyone here thinks it can easily be accomplished. Basically when I crawl https://sites.google.com/a/my-sub-domain.com, it redirects to SSO login page and crawler refuses to crawl any more giving a 302 error. It's an intranet based website.
Not sure whether this helps, just try it out. In nutch, we can provide credentials to login to the page, we have httpclient-auth.xml file in conf directory. There u can provide your host name along with the credentials.
Similarly you can add any number of credentials to this configuration.
To crawl https site, change plugin.includes property from protocol-http to protocol-httpclient in nutch-conf.xml