I have an .xdf file on an HDFS cluster which is around 10 GB having nearly 70 columns. I want to read it into a R object so that I could perform some transformation and manipulation. I tried to Google about it and come around with two functions:
rxReadXdf
rxXdfToDataFrame
Could any one tell me the preferred function for this as I want to read data & perform the transformation in parallel on each node of the cluster?
Also if I read and perform transformation in chunks, do I have to merge the output of each chunks?
Thanks for your help in advance.
Cheers, Amit
Note that
rxReadXdf
andrxXdfToDataFrame
have different arguments and do slightly different things:rxReadXdf
has a numRows argument, so use this if you want to read the top 1000 (say) rows of the datasetrxXdfToDataFrame
supports rxTransforms, so use this if you want to manipulate your data in addition to reading itrxXdfToDataFrame
also has the maxRowsByCols argument, which is another way of capping the size of the inputSo in your case, you want to use
rxXdfToDataFrame
since you're transforming the data in addition to reading it.rxReadXdf
is a bit faster in the local compute context if you just want to read the data (no transforms). This is probably also true for HDFS, but I haven’t checked this.However, are you sure that you want to read the data into a data frame? You can use
rxDataStep
to run (almost) arbitrary R code on an xdf file, while still leaving your data in that format. See the linked documentation page for how to use the transforms arguments.