How to write a cron job for Heritrix3 web crawling?

147 views Asked by At

I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java as Java application and then the server was built. And I have to open the browser to type https://localhost:8443 to build my job, then launch the job. Then unpause the job. How can I make a cron job for web crawling automatically? Please use Java language.

1

There are 1 answers

0
Du-Lacoste On

I have this automated for my FYP. You can use Java but still according to Heritrix documentation the calls will be CURLs hence best, easiest and fastest would be to use Shell Scripts to invoke the CURL and get the task done.

Get Current Status of Engine:

curl -v -k -u admin:admin --anyauth --location -H "Accept: application/xml"
˓→https://localhost:8443/engine

Create new job for crawling in the Engine:

curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --
˓→location \
-H "Accept: application/xml" https://localhost:8443/engine

Build the Job:

curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine/job/myjob

Launch the Job:

curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine