Cascalog Hadoop version support

471 views Asked by At

I notice that the Cascalog getting started guide specifies a version of Hadoop

:profiles { :dev {:dependencies [[org.apache.hadoop/hadoop-core "1.0.3"]]}}

If my group uses a different version of Hadoop then am I out of luck? More broadly with what set of Hadoop versions does Cascalog interoperate?

1

There are 1 answers

2
Daniel Canas On

The simple answer is currently (as of Aug 10 2014) Cascalog is at version 2.1.1 and by default uses Cascading 2.5.3 and Hadoop 1.2.1, so yes, if your team is not using Hadoop version 1.x then you're out of luck.

However, Cascalog could be ported to Hadoop 2.x. Cascading 2.5.x has support for Hadoop 2, from the docs Hadoop 1 vs Hadoop 2:

Cascading 2.5 supports both Hadoop 1.x and 2.x by providing two Java dependencies, cascading-hadoop.jar and cascading-hadoop2-mr1.jar. These dependencies can be interchanged but the hadoop2-mr1.jar introduces new and deprecates older API calls where appropriate. It should be pointed out hadoop1-mr1.jar only supports MapReduce 1 API conventions. With this naming scheme new API conventions can be introduced without risk of naming collisions on dependencies.

The following is a naive guide for updating Cascalog to Hadoop 2.x:

  • Update the cascading-hadoop jar in the project file
  • Update hadoop version in HADOOP-VERSION config file
  • Find all uses of deprecated Cascading API and update to new conventions.
  • Compile and fix warnings/errors
  • recur

I'm no expert in the Cascalog source, but uses of Cascading API can be found with a few lines of grep and upgrading the API seems pretty straight forward, if a little tedious.