How to write a cron job for Heritrix3 web crawling?

Question

How to write a cron job for Heritrix3 web crawling?

150 views Asked by 莫绮静 At 17 May 2017 at 08:34

I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java as Java application and then the server was built. And I have to open the browser to type https://localhost:8443 to build my job, then launch the job. Then unpause the job. How can I make a cron job for web crawling automatically? Please use Java language.

Original Q&A

There are 1 answers

**Du-Lacoste** · Answer 1 · 2023-05-06T03:14:54+00:00

I have this automated for my FYP. You can use Java but still according to Heritrix documentation the calls will be CURLs hence best, easiest and fastest would be to use Shell Scripts to invoke the CURL and get the task done.

Get Current Status of Engine:

curl -v -k -u admin:admin --anyauth --location -H "Accept: application/xml"
˓→https://localhost:8443/engine

Create new job for crawling in the Engine:

curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --
˓→location \
-H "Accept: application/xml" https://localhost:8443/engine

Build the Job:

curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine/job/myjob

Launch the Job:

curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine

TechQA.

How to write a cron job for Heritrix3 web crawling?

There are 1 answers

Related Questions in JAVA

Related Questions in WEB-CRAWLER

Related Questions in HERITRIX

Popular Questions

Popular Tags

Trending Questions