I am using crawler4j for crawling some websites and it is working fine. I am able to download all the files present in a website and now I have a new task ahead of me.I need to extract iframe,base64 and other embedded codes also if possible!
Till now what i am doing is, in my visit method
String place="<iframe";
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
// System.out.println("html sorce code:- "+html);
int number=html.length();
String[] result=html.split("\\s");
System.out.println("print random word"+result[12500]+number);
int i;
for(i=0;i<number;i++)
{
if(result[i].equals(place))
{
System.out.println("iframe found"+i);
}
}
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
I have added the above if case to get the iframes of the given html page.It is working almost near to perfect.
I know that this is a bad way of extracting of iframes from a html page.I tried many other ways to extract iframes and other embedded codes from html pages but failed.After going through the source code I found a java class which can satisfy my requirement.As you can see from the url above I have to call startElemnt method using necessary parameters in the HtmlContentHandler class inorder to get the required codes.
`public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException`
{
}
So In my visit method I have created a HtmlContentHandler object and tried to call the startElement method mentioned above.
HtmlContentHandler ecode=new HtmlContentHandler();
ecode.startElement(url,localName,qName,attributes);
Now the problem is with parameters of that method. I am sending the url value that is crawled for the url parameter and I have no idea what values I have to sent for the rest of the parameters!
Can some one help me in this? One more thing I know that many other tools can make my work easy but I want to do this in crawler4j instead!
Thank you!!
I don't use Java very much, and I haven't used crawler4j, but here are my two cents.
The class you refer to,
HtmlContentHandler
is a class that is used byHtmlParser
as an actual handler for extracting the links from the parsed web page.That said, you are not the one that should call the
startElement()
function, but rather it will be called by the parser for each element that it encounters. And when called, those arguments are populated to let you know the specifics of the element.This would be an example of this (not tested, I don't really know what I'm doing):
If you want to modify the behavior of the content handler, you should override the
ContentHandler
method and override thestartElement()
yourself, in a similar manner thatHtmlContentHandler
does. You can do it just to investigate the content of those function arguments if you like, it should give you a better understanding...But than again, I might be completely wrong :)