Which functions should I use to work with an XDF file on HDFS?

323 views Asked by At

I have an .xdf file on an HDFS cluster which is around 10 GB having nearly 70 columns. I want to read it into a R object so that I could perform some transformation and manipulation. I tried to Google about it and come around with two functions:

rxReadXdf

rxXdfToDataFrame

Could any one tell me the preferred function for this as I want to read data & perform the transformation in parallel on each node of the cluster?

Also if I read and perform transformation in chunks, do I have to merge the output of each chunks?

Thanks for your help in advance.

Cheers, Amit

1

There are 1 answers

0
Hong Ooi On

Note that rxReadXdf and rxXdfToDataFrame have different arguments and do slightly different things:

  • rxReadXdf has a numRows argument, so use this if you want to read the top 1000 (say) rows of the dataset
  • rxXdfToDataFrame supports rxTransforms, so use this if you want to manipulate your data in addition to reading it
  • rxXdfToDataFrame also has the maxRowsByCols argument, which is another way of capping the size of the input

So in your case, you want to use rxXdfToDataFrame since you're transforming the data in addition to reading it. rxReadXdf is a bit faster in the local compute context if you just want to read the data (no transforms). This is probably also true for HDFS, but I haven’t checked this.

However, are you sure that you want to read the data into a data frame? You can use rxDataStep to run (almost) arbitrary R code on an xdf file, while still leaving your data in that format. See the linked documentation page for how to use the transforms arguments.