I am trying to implement the data pipeline example by HotonWorks in an actual cluster. I have the HDP 2.2 version installed in my cluster but am getting the following error in the UI for the processes and Datasets tabs
Failed to load data. Error: 400 Bad Request
I have all services running except for HBase, Kafka, Knox, Ranger, Slider and Spark.
I have read the falcon entity specification that describes the individual tags for the cluster, feed and process definitions and modified the xml configuration files for the feeds and processes as follows
Cluster Definition
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="Analytics1" colo="Bangalore" xmlns="uri:falcon:cluster:0.1">
<interfaces>
<interface type="readonly" endpoint="hftp://node3.com.analytics:50070" version="2.6.0"/>
<interface type="write" endpoint="hdfs://node3.com.analytics:8020" version="2.6.0"/>
<interface type="execute" endpoint="node1.com.analytics:8050" version="2.6.0"/>
<interface type="workflow" endpoint="http://node1.com.analytics:11000/oozie/" version="4.1.0"/>
<interface type="messaging" endpoint="tcp://node1.com.analytics:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/user/falcon/primaryCluster/staging"/>
<location name="working" path="/user/falcon/primaryCluster/working"/>
</locations>
<ACL owner="falcon" group="hadoop"/>
</cluster>
Feed Definitions
RawEmailFeed
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
<tags>externalSystem=USWestEmailServers,classification=secure</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(4)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
<retention limit="days(3)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/falcon/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/none"/>
<location type="meta" path="/none"/>
</locations>
<ACL owner="falcon" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
cleansedEmailFeed
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
<tags>owner=USMarketing,classification=Secure,externalSource=USProdEmailServers,externalTarget=BITools</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
<retention limit="days(10)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/falcon/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="falcon" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
Process Definitions
rawEmailIngestProcess
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup,externalSystem=USWestEmailServers</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<outputs>
<output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailIngestWorkflow" version="2.0.0" engine="oozie" path="/user/falcon/apps/ingest/fs"/>
<retry policy="periodic" delay="minutes(15)" attempts="3"/>
<ACL owner="falcon" group="hadoop"/>
</process>
cleanseEmailProcess
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<inputs>
<input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
</inputs>
<outputs>
<output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailCleanseWorkflow" version="5.0" engine="pig" path="/user/falcon/apps/pig/id.pig"/>
<retry policy="periodic" delay="minutes(15)" attempts="3"/>
<ACL owner="falcon" group="hadoop"/>
</process>
I have not made any changes to the ingest.sh, workflow.xml and id.pig files. They are present in hdfs location /user/falcon/apps/ingest/fs (ingest.sh and workflow.xml) and /user/falcon/apps/pig (id.pig). Also I was not sure if the hidden .DS_Store file was required and hence did not include them in the aforementioned hdfs locations.
ingest.sh
#!/bin/bash
# curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put wiki-data/*.txt $1
curl -sS http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put enron_with_categories/*/*.txt $1
workflow.xml
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>ingest.sh</exec>
<argument>${feedInstancePaths}</argument>
<file>${wf:appPath()}/ingest.sh#ingest.sh</file>
<!-- <file>/tmp/ingest.sh#ingest.sh</file> -->
<!-- <capture-output/> -->
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
id.pig
A = load '$input' using PigStorage(',');
B = foreach A generate $0 as id;
store B into '$output' USING PigStorage();
I am not exactly sure how the process flow is taking place for the HDP example and would really appreciate if someone could clear that up.
Specifically, I do not understand the source of the arguments $1 given to ingest.sh. I believe it the hdfs location where the incoming data is to be stored. I noticed that workflow.xml has the tag <argument>${feedInstancePaths}</argument>.
Where does feedInstancePaths get its value from? I guess I'm getting the error because the feed is not getting stored in the proper location. But it may be a different problem.
The user Falcon also has 755 permission on all hdfs directories under /user/falcon
Any help and suggestions would be appreciated.
You are running your own cluster but this tutorial need the ressources asigned in the shellscript (ingest.sh):
I guess your cluster is not addressed at sandbox.hortonworks.com and further you dont have the needed ressource wiki-data.tar.gz. This tutorial only works with the offered sandbox.