Apache Beam - what are the limits of Deduplication function

Question

Apache Beam - what are the limits of Deduplication function

565 views Asked by kylebutters At 13 October 2020 at 11:26

I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework.

The document doesn't states the maximum input count for which the Deduplication function would work neither the maximum duration for which it can persist the data.

Would it be good design, to simply throw 50M records onto the deduplication function, out of which around half would be duplicates, and save keep the persistence duration of 7 days?

Original Q&A

There are 1 answers

**guillaume blaquiere** · Answer 1 · 2020-10-13T12:18:03+00:00

The deduplication function, as described in the link that you provide, performs a deduplicate per window.

If you have window of 1H, and you duplicate arrive every 3H, the function don't duplicate them, because they are in different windows.

So, you can define window over 1 day, or more. There is no limit. The data are stored on the workers (to save them), and also kept in memory (for efficiency). And more you have data, stronger and bigger must be the server config to manage the quantity of data.

TechQA.

Apache Beam - what are the limits of Deduplication function

There are 1 answers

Related Questions in JAVA

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Related Questions in APACHE-BEAM-IO

Popular Questions

Popular Tags

Trending Questions