Blocking Task on Java web application, and request timeout on Heroku server

302 views Asked by At

I am new to Java web programming, I'm trying to make a web crawler, Using the Crawler4j sample code

My problem is that when I submit the repost request, the Crawling task ( which is a blocking task) takes some time to get done, Heroku hosting has a request timeout of 3 seconds, which makes it impossible to run a synchronous crawling task, The same program works just fine on my local machine.

From what I read, it's not possible to change Heroku's timeout with the basic/free offer.

I was wondering if it was possible to launch this as an asynchronous task ( well I do know it is possible using the CrawlerController.startNonBlocking() function) and wait for it to finish so that I can show the results of the crawling operation.

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException{


    String url = request.getParameter("url");



CrawlConfig config = new CrawlConfig();   

String crawlStorageFolder = "/tmp/temp_storage";
config.setCrawlStorageFolder(crawlStorageFolder);    


int numberOfCrawlers = 1;

config.setPolitenessDelay(1);

config.setMaxDepthOfCrawling(2);

config.setMaxPagesToFetch(5);

config.setResumableCrawling(false);

PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = null;

try {
controller = new CrawlController(config, pageFetcher, robotstxtServer);
} catch(Exception e){
e.printStackTrace();
}

controller.addSeed(url);


controller.start(Crawler.class, numberOfCrawlers);
// Methods showing the results of the crawling ...
}
1

There are 1 answers

1
Matthias Steinbauer On BEST ANSWER

Hi you generally already answered the question by yourself. You should use some sort of background job to perform the crawling. However, you should not do this in the web-tier. Heroku has dedicated worker roles for that.

The basic idea here is that your browser is talking to the web process. The web process instructs a background worker to perform some job and reports successful job submission back to the users browser. You then use some JavaScript to regularly call back to the web frontend to check the progress of your background job.

In cloud stacks like Heroku please refrain from using library based background jobs that might launch background threads in the web tier. This is usually not supported in cloud stacks and bad practice on any other web stack.

The approach to be used is quite nicely highlighted in this help article from Heroku and nicely explained by sequence diagram in the "Approach" section.

https://devcenter.heroku.com/articles/background-jobs-queueing

https://devcenter.heroku.com/articles/background-jobs-queueing#approach

Sorry this is not a straight on code example. Still I hope this helps.