How to crawl a website that has SAML authentication using ManifoldCF or nutch?

1.7k views Asked by At

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 redirection to login page and then says RESPONSECODENOTINDEXABLE.

I am not sure if have I authenticated correctly or not. In manifoldCF we have options for HTTP basic authentication, NTLM authentication and Session-based access credentials authentication method. I used Session based authentication method which more looks like a form based authentication rather than SAML authentication.

Has anybody crawled a website using manifoldCF which has SAML authentication? And if not manifoldCF, has anyone been able to accomplish this via Apache Nutch, because I am afraid, it also provides only HTTP basic , Digest and NTLM authentication.

Any insight would be helpful. Can provide more information regarding the issue, if anyone here thinks it can easily be accomplished. Basically when I crawl https://sites.google.com/a/my-sub-domain.com, it redirects to SSO login page and crawler refuses to crawl any more giving a 302 error. It's an intranet based website.

3

There are 3 answers

0
User1203 On

Not sure whether this helps, just try it out. In nutch, we can provide credentials to login to the page, we have httpclient-auth.xml file in conf directory. There u can provide your host name along with the credentials.

<auth-configuration>
   <credentials username="admin" password="admin123">
      <authscope host="hostname" realm="login"/>
      <default/>
   </credentials>
</auth-configuration>

Similarly you can add any number of credentials to this configuration.

To crawl https site, change plugin.includes property from protocol-http to protocol-httpclient in nutch-conf.xml

0
user1264641 On

We have modified logic in Nutch protocol-selenium plugin to handle SSO flows. You need to wait for redirect to SSO page. Then using selenium you can handle SSO. Again wait for redirection to original page after SSO.

If 2 factor auth is required, then things become complex. In that case you can configure google authenticator (if allowed by your IdP). You can use that to get get TOTP.

For crawling files behind authentication there is no usual way. You can configure driver to always downlaod files and then use the docwnlaoded file.

You can handle the auth flow using another http clients. If you need dynamic page's content (after all JS and Ajax request completed) then selenium is the best choice and if you are using it, you can move auth part to selenium.

0
user1264641 On

There is no support in Nutch forSSO authentication using SAML. You need to handle it by writing your custom plugin. We have extended proptocol selenium plugin to handle SAML flows.