Nutch 2.3 REST curl syntax

364 views Asked by At

I'm trying to use curl to test out the Nutch 2.X REST API. I'm able to start the nutchserver and inject URLS, but I'm having trouble getting the generate step to work.

Here's what I've done:

curl -i -X POST -H "Content-Type:application/json" http://localhost:8081/job/create -d '{"crawlId":"crawl-01","type":"INJECT","confId":"default","args":{"seedDir":"/Users/username/myNutchFolder/apache-nutch-2.3/runtime/local/urls/"}}'

which when I look at jobs, shows that it finished and injected the appropriate number of urls.

Then I try to generate using

curl -i -X POST -H "Content-Type:application/json" http://localhost:8081/job/create -d '{"crawlId":"crawl-01","type":"GENERATE","confId":"default","args":{}}'

which fails, and has the following job status:

{
    "args": {},
    "confId": "default",
    "crawlId": "crawl-01",
    "id": "crawl-01-default-GENERATE-94689123",
    "msg": "ERROR: java.lang.RuntimeException: job failed: name=[crawl-01]generate: null, jobid=job_local473690964_0003",
    "result": null,
    "state": "FAILED",
    "type": "GENERATE"
},

I can't seem to find any documentation beyond the official API page: https://wiki.apache.org/nutch/NutchRESTAPI#Create_job so I was hoping someone here might know how to use the REST API to crawl (inject, generate, fetch, parse, updatedb) Any help in understanding even why my generate job failed would be greatly appreciated.

1

There are 1 answers

0
jgloves On BEST ANSWER

From the user mailing list, I learned the args to use for generate are:

"normalize":boolean

"filter":boolean

"crawlId":String

"curTime":long

"batch":String