Would it have a performance issue when merge the views using lambda architecture with spark?

59 views Asked by At

I did some study for the lambda architecture implemented by spark, and from the article below we know the way to merge the batch and real time views is using "realTimeView.unionAll(batchView).groupBy......" , however when the data behind the batchView is very large, wouldn't it have performance issue using such way ???

For example if the number of the rows behind the batchView is 100,000,000, then spark have to groupBy 100,000,000 rows every time when client request the merge view ,this is obviously very slow.

https://dzone.com/articles/lambda-architecture-with-apache-spark

DataFrame realTimeView = streamingService.getRealTimeView();
DataFrame batchView = servingService.getBatchView();
DataFrame mergedView = realTimeView.unionAll(batchView)
                                   .groupBy(realTimeView.col(HASH_TAG.getValue()))
                               .sum(COUNT.getValue())
                               .orderBy(HASH_TAG.getValue());
List<Row> merged = mergedView.collectAsList();
return merged.stream()
.map(row -> new HashTagCount(row.getString(0), row.getLong(1)))
.collect(Collectors.toList());
0

There are 0 answers