I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework.
The document doesn't states the maximum input count for which the Deduplication function would work neither the maximum duration for which it can persist the data.
Would it be good design, to simply throw 50M records onto the deduplication function, out of which around half would be duplicates, and save keep the persistence duration of 7 days?
The deduplication function, as described in the link that you provide, performs a deduplicate per window.
If you have window of 1H, and you duplicate arrive every 3H, the function don't duplicate them, because they are in different windows.
So, you can define window over 1 day, or more. There is no limit. The data are stored on the workers (to save them), and also kept in memory (for efficiency). And more you have data, stronger and bigger must be the server config to manage the quantity of data.