I read the book of Nathan Marz on the lambda architecture. I'm actually making a proof of concept of this solution.
I have difficulties to build my Jcascalog query.
This is the piece of my thrift schema which interest us :
union ArticlePropertyValue {
1: decimal quantity,
2: string name;
}
union ArticleID {
1: int id;
}
struct ArticleProperty {
1: required ArticleID id;
2: required ArticlePropertyValue property;
}
union DataUnit {
1: TicketProperty ticket_property;
2: ArticleProperty article_property;
}
I stored some data with Pail into the folder : /home/tickets
Now I want to make a request on this data : I want to get the sum of the quantity grouping by article name. So first I need to get the names, and after the quantity. For each I can get the ID of the article.
For example I will have this result for the name request(id_article, name): (1, pasta) - (2, pasta2) - (3, pasta)
For the quantity request (id_article, quantity): (1, 2) - (2, 1) - (3, 1)
Tap source = splitDataTap("/home/florian/Workspace/tickets");
Api.execute(
new StdoutTap(),
new Subquery("?name", "?sum")
.predicate(source, "_", "?data")
.predicate(new ExtractArticleName(), "?data")
.out("?id", "?name")
.predicate(new ExtractArticleQuantity(), "?data")
.out("?id", "?quantity")
.predicate(new Sum(), "?quantity")
.out("?sum")
);
The problem is that I don't how to merge the result ? How can I perfom join with Cascalog and data in HDFS ?
I guess you want to store the result of this query in HDFS, then you need to do the following:
Say the data is to be saved in "/data" folder, and in simple text format, thenyou need to do this: