scala nutch gora-cassandra - RuntimeException: job failed

359 views Asked by At

I'm trying to run nutch and load the crawled data into cassandra.

I've got my sbt file

"org.apache.gora" % "gora-cassandra" % "0.3",
"org.apache.nutch" % "nutch" % "2.2.1",
"com.datastax.cassandra" % "cassandra-driver-core" % "2.1.2"

and am kicking off the job

ToolRunner.run(NutchConfiguration.create(), new Crawler(), Array("urls"));

but am hitting the slightly vague error EDIT - updated to be full logs from start of request

[Ljava.lang.String;@526950c7
****file:/home/abdev/Working/Qordaoba/gl/web-crawling-services/crawling-services/urls
[error] play - Cannot invoke the action, eventually got an error: java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local_0002
[error] application - 

! @6kemm159h - Internal server error, for (POST) [/nutch/job] ->

play.api.Application$$anon$1: Execution exception[[RuntimeException: job failed: name=generate: null, jobid=job_local_0002]]
    at play.api.Application$class.handleError(Application.scala:296) ~[play_2.11-2.3.6.jar:2.3.6]
    at play.api.DefaultApplication.handleError(Application.scala:402) [play_2.11-2.3.6.jar:2.3.6]
    at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$3$$anonfun$applyOrElse$4.apply(PlayDefaultUpstreamHandler.scala:320) [play_2.11-2.3.6.jar:2.3.6]
    at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$3$$anonfun$applyOrElse$4.apply(PlayDefaultUpstreamHandler.scala:320) [play_2.11-2.3.6.jar:2.3.6]
    at scala.Option.map(Option.scala:145) [scala-library-2.11.1.jar:na]
Caused by: java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local_0002
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) ~[nutch-2.2.1.jar:na]
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199) ~[nutch-2.2.1.jar:na]
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) ~[nutch-2.2.1.jar:na]
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:152) ~[nutch-2.2.1.jar:na]
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) ~[nutch-2.2.1.jar:na]

In cassandra - the keyspace webpage and tables sc p f are being created before the error is thrown.

EDIT --- If I put all (sorry its a long list I know) the below jars in my lib folder - then the job runs; and the first few logs are about connecting to cassandra. I don't see those logs when I'm trying to just use the SBT dependencies.

Logs when running with below jar files:

SLF4J: The following set of substitute loggers may have been accessed
SLF4J: during the initialization phase. Logging calls during this
SLF4J: phase were not honored. However, subsequent logging calls to these
SLF4J: loggers will work as normally expected.
SLF4J: See also http://www.slf4j.org/codes.html#substituteLogger
SLF4J: org.webjars.WebJarExtractor
[info] Compiling 5 Scala sources and 1 Java source to /home/abdev/Working/Qordaoba/gl/web-crawling-services/crawling-services/target/scala-2.11/classes...
14/12/10 07:31:03 INFO play: Application started (Dev)
14/12/10 07:31:03 INFO slf4j.Slf4jLogger: Slf4jLogger started
[Ljava.lang.String;@3a6f1296
14/12/10 07:31:05 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/12/10 07:31:05 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
14/12/10 07:31:06 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
14/12/10 07:31:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/12/10 07:31:06 INFO input.FileInputFormat: Total input paths to process : 1

Full list of Jar files

activation-1.1.jar
antlr-3.2.jar
aopalliance-1.0.jar
apache-cassandra-1.2.19.jar
apache-cassandra-clientutil-1.2.19.jar
apache-cassandra-thrift-1.2.19.jar
apache-nutch-2.2.1.jar
asm-3.2.jar
avro-1.3.3.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.8.0.jar
commons-cli-1.1.jar
commons-cli-1.2.jar
commons-codec-1.2.jar
commons-codec-1.4.jar
commons-collections-3.2.1.jar
commons-configuration-1.6.jar
commons-digester-1.8.jar
commons-el-1.0.jar
commons-httpclient-3.1.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-logging-1.1.1.jar
commons-math-2.1.jar
commons-net-1.4.1.jar
compress-lzf-0.8.4.jar
concurrentlinkedhashmap-lru-1.3.jar
cql-internal-only-1.4.1.zip
crawler-commons-0.2.jar
cxf-api-2.5.2.jar
cxf-common-utilities-2.5.2.jar
cxf-rt-bindings-xml-2.5.2.jar
cxf-rt-core-2.5.2.jar
cxf-rt-frontend-jaxrs-2.5.2.jar
cxf-rt-transports-common-2.5.2.jar
cxf-rt-transports-http-2.5.2.jar
elasticsearch-0.19.4.jar
geronimo-javamail_1.4_spec-1.7.1.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
gora-cassandra-0.3.jar
gora-core-0.3.jar
guava-11.0.2.jar
guava-13.0.1.jar
hadoop-core-1.2.0.jar
hamcrest-core-1.3.jar
hector-core-1.1-4.jar
high-scale-lib-1.1.2.jar
hsqldb-2.2.8.jar
httpclient-4.1.1.jar
httpcore-4.1.jar
icu4j-4.0.1.jar
jackson-core-asl-1.8.8.jar
jackson-core-asl-1.9.2.jar
jackson-jaxrs-1.7.1.jar
jackson-mapper-asl-1.8.8.jar
jackson-mapper-asl-1.9.2.jar
jackson-xc-1.7.1.jar
jamm-0.2.5.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jbcrypt-0.3m.jar
jdom-1.1.jar
jersey-core-1.8.jar
jersey-json-1.8.jar
jersey-server-1.8.jar
jettison-1.3.1.jar
jetty-6.1.26.jar
jetty-client-6.1.26.jar
jetty-sslengine-6.1.26.jar
jetty-util5-6.1.26.jar
jetty-util-6.1.26.jar
jline-0.9.1.jar
jline-1.0.jar
json-simple-1.1.jar
jsr305-1.3.9.jar
jsr311-api-1.1.1.jar
junit-4.11.jar
juniversalchardet-1.0.3.jar
libthrift-0.7.0.jar
log4j-1.2.16.jar
lucene-analyzers-3.6.0.jar
lucene-core-3.6.0.jar
lucene-highlighter-3.6.0.jar
lucene-memory-3.6.0.jar
lucene-queries-3.6.0.jar
lz4-1.1.0.jar
metrics-core-2.2.0.jar
neethi-3.0.1.jar
org.osgi.core-4.0.0.jar
org.restlet.ext.jackson-2.0.5.jar
org.restlet-2.0.5.jar
oro-2.0.8.jar
paranamer-2.2.jar
paranamer-ant-2.2.jar
paranamer-generator-2.2.jar
qdox-1.10.1.jar
serializer-2.7.1.jar
servlet-api-2.5-6.1.14.jar
servlet-api-2.5-20081211.jar
slf4j-api-1.6.6.jar
slf4j-api-1.7.2.jar
slf4j-log4j12-1.6.1.jar
slf4j-log4j12-1.7.2.jar
snakeyaml-1.6.jar
snappy-java-1.0.5.jar
snaptree-0.1.jar
solr-solrj-3.4.0.jar
spring-aop-3.0.6.RELEASE.jar
spring-asm-3.0.6.RELEASE.jar
spring-beans-3.0.6.RELEASE.jar
spring-context-3.0.6.RELEASE.jar
spring-core-3.0.6.RELEASE.jar
spring-expression-3.0.6.RELEASE.jar
spring-web-3.0.6.RELEASE.jar
stax2-api-3.1.1.jar
stax-api-1.0.1.jar
stax-api-1.0-2.jar
thrift-python-internal-only-0.7.0.zip
tika-core-1.3.jar
woodstox-core-asl-4.1.1.jar
wsdl4j-1.6.2.jar
wstx-asl-3.2.7.jar
xercesImpl-2.9.1.jar
xml-apis-1.3.04.jar
xmlenc-0.52.jar
xmlParserAPIs-2.6.2.jar
xmlschema-core-2.0.1.jar
zookeeper-3.3.1.jar

Thanks, Brent

0

There are 0 answers