Getting all iframes,base64 codes which are present in html pages using crawler4j

314 views Asked by At

I am using crawler4j for crawling some websites and it is working fine. I am able to download all the files present in a website and now I have a new task ahead of me.I need to extract iframe,base64 and other embedded codes also if possible!

Till now what i am doing is, in my visit method

 String place="<iframe";
 if (page.getParseData() instanceof HtmlParseData) {
                 HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                 String text = htmlParseData.getText();
                 String html = htmlParseData.getHtml();
                 List<WebURL> links = htmlParseData.getOutgoingUrls();
                 System.out.println("Text length: " + text.length());
                // System.out.println("html sorce code:- "+html);
                 int number=html.length();
                 String[] result=html.split("\\s");
                 System.out.println("print random word"+result[12500]+number);
                 int i;
                 for(i=0;i<number;i++)
                 {
                     if(result[i].equals(place))
                     {
                         System.out.println("iframe found"+i);
                     }
                 }
                 System.out.println("Text length: " + text.length());
                 System.out.println("Html length: " + html.length());
                 System.out.println("Number of outgoing links: " + links.size());
                 }

I have added the above if case to get the iframes of the given html page.It is working almost near to perfect.

I know that this is a bad way of extracting of iframes from a html page.I tried many other ways to extract iframes and other embedded codes from html pages but failed.After going through the source code I found a java class which can satisfy my requirement.As you can see from the url above I have to call startElemnt method using necessary parameters in the HtmlContentHandler class inorder to get the required codes.

`public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException` 
{
}

So In my visit method I have created a HtmlContentHandler object and tried to call the startElement method mentioned above.

HtmlContentHandler ecode=new HtmlContentHandler();
 ecode.startElement(url,localName,qName,attributes);

Now the problem is with parameters of that method. I am sending the url value that is crawled for the url parameter and I have no idea what values I have to sent for the rest of the parameters!

Can some one help me in this? One more thing I know that many other tools can make my work easy but I want to do this in crawler4j instead!

Thank you!!

1

There are 1 answers

0
bosnjak On BEST ANSWER

I don't use Java very much, and I haven't used crawler4j, but here are my two cents.

The class you refer to, HtmlContentHandler is a class that is used by HtmlParser as an actual handler for extracting the links from the parsed web page.
That said, you are not the one that should call the startElement() function, but rather it will be called by the parser for each element that it encounters. And when called, those arguments are populated to let you know the specifics of the element.
This would be an example of this (not tested, I don't really know what I'm doing):

HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler contentHandler = new HtmlContentHandler();
// I presume the `Page page` is present in the scope
InputStream inputStream = new ByteArrayInputStream(page.getContentData());
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
// and finally parse
htmlParser.parse(inputStream, contentHandler, metadata, parseContext);

If you want to modify the behavior of the content handler, you should override the ContentHandler method and override the startElement() yourself, in a similar manner that HtmlContentHandler does. You can do it just to investigate the content of those function arguments if you like, it should give you a better understanding...

But than again, I might be completely wrong :)