Would it have a performance issue when merge the views using lambda architecture with spark?

59 views Asked by YuFeng Shen At 22 August 2017 at 12:45

I did some study for the lambda architecture implemented by spark, and from the article below we know the way to merge the batch and real time views is using "realTimeView.unionAll(batchView).groupBy......" , however when the data behind the batchView is very large, wouldn't it have performance issue using such way ???

For example if the number of the rows behind the batchView is 100,000,000, then spark have to groupBy 100,000,000 rows every time when client request the merge view ,this is obviously very slow.

https://dzone.com/articles/lambda-architecture-with-apache-spark

DataFrame realTimeView = streamingService.getRealTimeView();
DataFrame batchView = servingService.getBatchView();
DataFrame mergedView = realTimeView.unionAll(batchView)
                                   .groupBy(realTimeView.col(HASH_TAG.getValue()))
                               .sum(COUNT.getValue())
                               .orderBy(HASH_TAG.getValue());
List<Row> merged = mergedView.collectAsList();
return merged.stream()
.map(row -> new HashTagCount(row.getString(0), row.getLong(1)))
.collect(Collectors.toList());

Original Q&A

TechQA.

Would it have a performance issue when merge the views using lambda architecture with spark?

There are 0 answers

Related Questions in APACHE-SPARK

Related Questions in LAMBDA-ARCHITECTURE

Popular Questions

Trending Questions