I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java
as Java application and then the server was built. And I have to open the browser to type https://localhost:8443
to build my job, then launch the job. Then unpause the job. How can I make a cron job for web crawling automatically? Please use Java language.
How to write a cron job for Heritrix3 web crawling?
157 views Asked by 莫绮静 At
1
I have this automated for my FYP. You can use Java but still according to
Heritrix
documentation the calls will beCURLs
hence best, easiest and fastest would be to useShell
Scripts to invoke theCURL
and get the task done.Get Current Status of Engine:
Create new job for crawling in the Engine:
Build the Job:
Launch the Job: