Let's say I have a DataSet that look like this:
Name | Grade
---------------
Josh | 94
Josh | 87
Amanda | 96
Karen | 78
Amanda | 90
Josh | 88
I would like to create a new DataSet where each name has 3 rows, where the additional rows (if any) are sampled from the ones of the same name (so Karen will have three identical rows, for example).
How do I do that without looping through each name?
Data preparation :
Perform the following , only if your Data is
skewedfor aName: Add a random number, and filter the top 3 random numbers for eachName.Now, aggregate the values for each
Nameand duplicate 3 times to ensure we have atleast 3 records for eachName. Then finally take 1st 3 values, andexplodeNotes :
grade_list, hence Duplicating it 3 times won't harm.Windowstep, you can have a combination ofwhen( size($"grade_list") === n, ).otherwise()to above unnecessary duplication.