Let's say I have a DataSet
that look like this:
Name | Grade
---------------
Josh | 94
Josh | 87
Amanda | 96
Karen | 78
Amanda | 90
Josh | 88
I would like to create a new DataSet
where each name has 3 rows, where the additional rows (if any) are sampled from the ones of the same name (so Karen will have three identical rows, for example).
How do I do that without looping through each name?
Data preparation :
Perform the following , only if your Data is
skewed
for aName
: Add a random number, and filter the top 3 random numbers for eachName
.Now, aggregate the values for each
Name
and duplicate 3 times to ensure we have atleast 3 records for eachName
. Then finally take 1st 3 values, andexplode
Notes :
grade_list
, hence Duplicating it 3 times won't harm.Window
step, you can have a combination ofwhen( size($"grade_list") === n, ).otherwise()
to above unnecessary duplication.