How to download text contained in JavaScript files via crawler4j?

400 views Asked by At

I'm trying to use crawler4j to extract text from some websites. However, while I have changed the Filters to allow extensions with js in the following manner

 private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|gif|jpg"
        + "|png|mp3|mp3|zip|gz))$");

I do not know how to store this text to a file (if there is a different method of doing so for text in js files as opposed to regular text)

1

There are 1 answers

0
rzo1 On

"visit" is called, after the page is successfully processed by the web-crawler. The content is then contained in this object.

I suggest, that you can then use the provided methods to write down your crawled javascript content, e.g. parsing the binary content.

@Override
 public void visit(Page page) {
     //parse the binary content contained in the page object
}

An example (well it is related to images, but the way is basically the same) can be found here: https://github.com/yasserg/crawler4j/blob/master/src/test/java/edu/uci/ics/crawler4j/examples/imagecrawler/ImageCrawler.java