There is some info on the web indicating that Mahout's XMLInputFormat can be used to efficiently process XML on hadoop, but I've been unable to find an example of how to get this working. Can someone point me in the right direction?
I'm using Cascalog/Clojure.
Just have a look at this to read a xml file using hadoop implementation of record reader:
http://javatute.com/javatute/faces/post/hadoop/2014/reading-simple-xml-file-using-hadoop.xhtml