I know that my question may be sounding too naive for this community, but I can't figure it out myself and need some insights from more experienced people. So, the point I am trying to address is related to the way I should use xgboost package for imputation of missing values in data.frame.
In other words, my task is to somehow produce a script that utilizes xgboost library to impute missing (aka NAs) values in the data set. I think I know how to make xgboost work for data prediction with complete data sets, but I don't quite get what the right way to set it up to fill in NAs.
I would greatly appreciate any help/insights on the matter. Primarily, it would be great to know what the community think re the logic/algorithm, which I then be able to code.
Below is an extract from a file of 300 rows that I have as an input for xgboost. My task is to impute NAs in the Var_1 variable using information from the other 4 variables (Year, Month, Country, Var_2). As a matter of fact, Var_1 depends on the other variable in the data set.
Year Month Country Var_1 Var_2
<dbl> <dbl> <dbl> <dbl> <dbl>
2021 1 36 599 16022.
2021 1 56 99224 612030.
2021 1 124 1159 33535.
2021 1 156 28 16119.
2021 1 208 215 68027.
2021 1 251 84898 1479103.
2021 1 276 142545 634540.
2021 1 344 14 1397.
2021 1 372 1893654 6993299.
2021 1 380 176 2483.
2021 1 392 58 22968.
2021 1 484 61028 74968.
2021 1 528 7742 125132.
2021 1 616 313 4877.
2021 1 703 55 510.
2021 1 724 364573 2829297.
2021 1 826 NA 1622357.
2021 1 842 575020 3536907.
2021 2 36 45 3604.
2021 2 56 46395 278347.
2021 2 124 11760 90933.
2021 2 208 29768 393233.
2021 2 251 31397 177593.
2021 2 276 4316 23745.
2021 2 320 3926 25794.
2021 2 344 13859 93853.
2021 2 372 1233218 4527629.
2021 2 380 209 5354.
2021 2 392 NA 230771.
2021 2 528 16891 164109.
2021 2 591 2 3443.
2021 2 699 5505 91394.
2021 2 703 144 995.
2021 2 710 6464 5112.
2021 2 724 122337 1037545.
2021 2 752 433 4598.
2021 2 757 23 1672.
2021 2 784 671 21453.
2021 2 826 NA 703348.
2021 2 842 906031. 5314218.
2021 3 56 655375 4037557.
2021 3 124 25789 200786.
2021 3 156 65 60928.
2021 3 170 5 1391.
2021 3 208 21383 489739.
2021 3 251 105713 416264.
2021 3 276 63063 438859.
2021 3 344 9 18422.
2021 3 372 1192275 5794848.
2021 3 376 972 22502.
2021 3 392 11910 327853.
2021 3 528 11472 59048.
2021 3 554 23 1473.
2021 3 702 9058 86097.
2021 3 703 14 309.
2021 3 710 2369 38682.
2021 3 724 67483 534919.
2021 3 752 4222 31359.
2021 3 757 145 13019.
2021 3 764 13133 72650.
2021 3 826 NA 1114520.
2021 3 842 1029945 6163697.
I googled and read a lot of papers claiming that xgboost is a great tool for imputing missing values, but I wasn't able to find any examples of the code someone used to actually perform the imputation. I read the manual at link, which doesn't say a word about using the algo for impuation. Finally, I came across the mixgb package, which turned out to be a great tool for doing this procedure.
I tried and use the library in my research along with the mice, missForest, Amelia, tidymodels, and some other packages.