Installing parquet-tools

41.2k views Asked by At

I am trying to install parquet tools on a FreeBSD machine.

I cloned this repo: git clone https://github.com/apache/parquet-mr

Then I did cd parquet-mr/parquet-tools

Then I did `mvn clean package -Plocal

As specified here: https://github.com/apache/parquet-mr/tree/master/parquet-tools

This is what I got:

enter image description here

Why is this dependency error here? How do I get around it?

5

There are 5 answers

4
Zoltan On

parquet-tools is just one module of parquet-mr. It depends on some of the other modules.

When you build from a source version that corresponds to a release, those other modules will be available to Maven, because release artifacts are published as a part of the release process.

However, when building from a snapshot version, you have to make those dependencies available yourself. There are two ways to do so:

Option 1: Build and install all modules of the parent directory:

git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn install -Plocal

This will put the snapshot artifacts in your local ~/.m2 directory. Subsequently, you can (re)build just parquet-tools like you initially tried, because now the snapshot artifacts will already be available from ~/.m2.

Option 2: Build the parquet-mr modules from the parent directory, while asking Maven to build needed modules as well along the way:

git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn package -pl parquet-tools -am -Plocal

Option 1 will build more projects than option 2, so if you only need parquet-tools, you are better off with the latter. Please note though that probably both will require installation of a thrift compiler.

3
Nagev On

On Ubuntu 20, I install via pip:

python3 -m pip install parquet-tools

Haven't tried on FreeBSD but I'd imagine it would also work. See related answer for a caveat on using pip on FreeBSD.

And you can view a file with:

parquet-tools show filename.parquet
1
Shashank On

Parquet tools- A utility that can be leveraged to read parquet files. Yuu can clone it from Github and run some maven command.

1. git clone https://github.com/Parquet/parquet-mr.git 
2. cd parquet-mr/parquet-tools/ 
3. mvn clean package -Plocal 


OR You can download stable release & built from local.

  1. Downloading stable Parquet release.

    https://github.com/apache/parquet-mr/archive/apache-parquet-1.8.2.tar.gz


2. Maven local install.

 D:\parquet>cd parquet-tools && mvn clean package -Plocal

enter image description here


3. Test it (paste a parquet file under target directory):

 D:\parquet\parquet-tools\target>java -jar parquet-tools-1.8.2.jar schema out.parquet

(where out.parquet is my parquet file under target directory)

enter image description here

// Read parquet file

D:\parquet\parquet-tools\target>java -jar parquet-tools-1.6.0.jar cat out.parquet

// Read few lines in parquet file

D:\parquet\parquet-tools\target>java -jar parquet-tools-1.6.0.jar head -n5 out.parquet 
0
Wheezil On

Some answers have broken link for the jar download, but you can get it from maven central

However... this jar and others like it are built so that the hadoop dependencies are "provided" and if you build from source, you'll get that default. So you need to set -Dhadoop.scope=compile when you build, or the result will only work when run on a hadoop node using the "hadoop ..." command.

To make matters worse, this tool apparently disables System.out and System.err so that exceptions that cause main() fails are never printed and you'll be left wondering what happened.

I also found that the default settings for the maven-license-plugin caused it to fail the build when files showed up that it didn't expect (e.g. nbactions.xml if you use netbeans).

2
jeffhu On

I know the question specifies FreeBSD, but if you're on mac, you can do

brew install parquet-tools