No size difference between Hadoop SequenceFile and TextFile?

71 views Asked by At

I am trying to compress my Spark output files and I found that SequenceFile's can be used for it.

I saved the file in Java like that:

JavaPairRDD<Text, Text> result = ...
result.coalesce(1).saveAsNewAPIHadoopFile(outputPath.toString() + ".seq", Text.class, Text.class, SequenceFileOutputFormat.class);

However, I couldn't get any size difference between saveAsTextFile output and this sequence file output. I have seen different methods to create Sequence files but most of them using Scala and I should use Java, so I used this method.

result pair rdd is something like:

1, 123.456, 123.457, 123.458, ...
2, 123.789, 123.790, 123.791, ...
...

Am I doing something wrong? Or do I understand sequence file concept completely wrong.

By the way, this output file then be used in R for data analysis. And I can't use SparkSQL, Dataframes etc.

If you have other suggestions like Parquet or Avro, where I won't use DataFrames, that would be so nice.

I just need to compress my files and they should be decompressed or directly used over Hadoop API's or R libraries.

0

There are 0 answers