How to use xgBoost for imputation?

56 views Asked by At

I know that my question may be sounding too naive for this community, but I can't figure it out myself and need some insights from more experienced people. So, the point I am trying to address is related to the way I should use xgboost package for imputation of missing values in data.frame.

In other words, my task is to somehow produce a script that utilizes xgboost library to impute missing (aka NAs) values in the data set. I think I know how to make xgboost work for data prediction with complete data sets, but I don't quite get what the right way to set it up to fill in NAs.

I would greatly appreciate any help/insights on the matter. Primarily, it would be great to know what the community think re the logic/algorithm, which I then be able to code.

Below is an extract from a file of 300 rows that I have as an input for xgboost. My task is to impute NAs in the Var_1 variable using information from the other 4 variables (Year, Month, Country, Var_2). As a matter of fact, Var_1 depends on the other variable in the data set.

     Year Month Country    Var_1     Var_2
    <dbl> <dbl>   <dbl>    <dbl>     <dbl>
  2021     1      36     599     16022.
  2021     1      56   99224    612030.
  2021     1     124    1159     33535.
  2021     1     156      28     16119.
  2021     1     208     215     68027.
  2021     1     251   84898   1479103.
  2021     1     276  142545    634540.
  2021     1     344      14      1397.
  2021     1     372 1893654   6993299.
  2021     1     380     176      2483.
  2021     1     392      58     22968.
  2021     1     484   61028     74968.
  2021     1     528    7742    125132.
  2021     1     616     313      4877.
  2021     1     703      55       510.
  2021     1     724  364573   2829297.
  2021     1     826      NA   1622357.
  2021     1     842  575020   3536907.
  2021     2      36      45      3604.
  2021     2      56   46395    278347.
  2021     2     124   11760     90933.
  2021     2     208   29768    393233.
  2021     2     251   31397    177593.
  2021     2     276    4316     23745.
  2021     2     320    3926     25794.
  2021     2     344   13859     93853.
  2021     2     372 1233218   4527629.
  2021     2     380     209      5354.
  2021     2     392      NA    230771.
  2021     2     528   16891    164109.
  2021     2     591       2      3443.
  2021     2     699    5505     91394.
  2021     2     703     144       995.
  2021     2     710    6464      5112.
  2021     2     724  122337   1037545.
  2021     2     752     433      4598.
  2021     2     757      23      1672.
  2021     2     784     671     21453.
  2021     2     826      NA    703348.
  2021     2     842  906031.  5314218.
  2021     3      56  655375   4037557.
  2021     3     124   25789    200786.
  2021     3     156      65     60928.
  2021     3     170       5      1391.
  2021     3     208   21383    489739.
  2021     3     251  105713    416264.
  2021     3     276   63063    438859.
  2021     3     344       9     18422.
  2021     3     372 1192275   5794848.
  2021     3     376     972     22502.
  2021     3     392   11910    327853.
  2021     3     528   11472     59048.
  2021     3     554      23      1473.
  2021     3     702    9058     86097.
  2021     3     703      14       309.
  2021     3     710    2369     38682.
  2021     3     724   67483    534919.
  2021     3     752    4222     31359.
  2021     3     757     145     13019.
  2021     3     764   13133     72650.
  2021     3     826      NA   1114520.
  2021     3     842 1029945   6163697.

I googled and read a lot of papers claiming that xgboost is a great tool for imputing missing values, but I wasn't able to find any examples of the code someone used to actually perform the imputation. I read the manual at link, which doesn't say a word about using the algo for impuation. Finally, I came across the mixgb package, which turned out to be a great tool for doing this procedure.

I tried and use the library in my research along with the mice, missForest, Amelia, tidymodels, and some other packages.

0

There are 0 answers