Jcascalog to query thrift data on HDFS

247 views Asked by At

I read the book of Nathan Marz on the lambda architecture. I'm actually making a proof of concept of this solution.

I have difficulties to build my Jcascalog query.

This is the piece of my thrift schema which interest us :

union ArticlePropertyValue {
  1: decimal quantity,
  2: string name;
}

union ArticleID {
  1: int id;
}

struct ArticleProperty {
   1: required ArticleID id;
   2: required ArticlePropertyValue property;
}

union DataUnit {
  1: TicketProperty ticket_property;
  2: ArticleProperty article_property;
}

I stored some data with Pail into the folder : /home/tickets

Now I want to make a request on this data : I want to get the sum of the quantity grouping by article name. So first I need to get the names, and after the quantity. For each I can get the ID of the article.

For example I will have this result for the name request(id_article, name): (1, pasta) - (2, pasta2) - (3, pasta)

For the quantity request (id_article, quantity): (1, 2) - (2, 1) - (3, 1)

  Tap source = splitDataTap("/home/florian/Workspace/tickets");
  Api.execute(
          new StdoutTap(),
          new Subquery("?name", "?sum")
            .predicate(source, "_", "?data")
            .predicate(new ExtractArticleName(), "?data")
                .out("?id", "?name")
            .predicate(new ExtractArticleQuantity(), "?data")
                .out("?id", "?quantity")
            .predicate(new Sum(), "?quantity")
                .out("?sum")
          );

The problem is that I don't how to merge the result ? How can I perfom join with Cascalog and data in HDFS ?

1

There are 1 answers

0
Sahil On

I guess you want to store the result of this query in HDFS, then you need to do the following:

Say the data is to be saved in "/data" folder, and in simple text format, thenyou need to do this:

Subquery subquery =  new Subquery("?name", "?sum")
            .predicate(source, "_", "?data")
            .predicate(new ExtractArticleName(), "?data")
            .out("?id", "?name")
            .predicate(new ExtractArticleQuantity(), "?data")
            .out("?id", "?quantity")
            .predicate(new Sum(), "?quantity")
            .out("?sum");

Api.execute(Api.hfsTextline("/data"), subquery);