I have an .xdf file on an HDFS cluster which is around 10 GB having nearly 70 columns. I want to read it into a R object so that I could perform some transformation and manipulation. I tried to Google about it and come around with two functions:
rxReadXdf
rxXdfToDataFrame
Could any one tell me the preferred function for this as I want to read data & perform the transformation in parallel on each node of the cluster?
Also if I read and perform transformation in chunks, do I have to merge the output of each chunks?
Thanks for your help in advance.
Cheers, Amit
Note that
rxReadXdfandrxXdfToDataFramehave different arguments and do slightly different things:rxReadXdfhas a numRows argument, so use this if you want to read the top 1000 (say) rows of the datasetrxXdfToDataFramesupports rxTransforms, so use this if you want to manipulate your data in addition to reading itrxXdfToDataFramealso has the maxRowsByCols argument, which is another way of capping the size of the inputSo in your case, you want to use
rxXdfToDataFramesince you're transforming the data in addition to reading it.rxReadXdfis a bit faster in the local compute context if you just want to read the data (no transforms). This is probably also true for HDFS, but I haven’t checked this.However, are you sure that you want to read the data into a data frame? You can use
rxDataStepto run (almost) arbitrary R code on an xdf file, while still leaving your data in that format. See the linked documentation page for how to use the transforms arguments.