Issues saving to Hive table from Pig

1.4k views Asked by At

I am using HCatalog to read and write data to Hive from Pig Script as follows:

A = LOAD 'customer' USING org.apache.hcatalog.pig.HCatLoader();

B = LOAD 'address' USING org.apache.hcatalog.pig.HCatLoader();

C = JOIN A by cmr_id,B by cmr_id;

STORE C INTO 'cmr_address_join' USING org.apache.hcatalog.pig.HCatStorer();

Table definition for customer is:

cmr_id                  int                     
name                    string                   

Address:

addr_id                 int                     
cmr_id                  int                     
address                 string                  

cmr_address_join:

cmr_id                  int                     
name                    string                  
addr_id                 int                     
address                 string    

When I run this, Pig throws the following error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1115: Column names should all be in lowercase. Invalid name found: A::cmr_id

I believe this may be because Pig is trying to match Pig generated file names with the Hive columns and it is not exactly matching (A::cmr_id versus cmr_id). I think HCatalogStorer is expecting the alias to be cmr_id and not A::cmr_id. I wish HCatalogStorer ignored the alias prefix and only considered the field name.

grunt>  DESCRIBE C;

C: {A::cmr_id: int,A::name: chararray,B::addr_id: int,B::cmr_id: int,B::address: chararray}

Is there a way to drop the prefix of the field in Pig (i.e. A::)? Or if someone has a workaround or a solution, it would be great.

I know we can use the following to explicitly add an alias and get this to work.

D = foreach C generate A::cmr_id as cmr_id,A::name as name, B::addr_id as addr_id, B::address as address;

STORE D INTO 'cmr_address_join' USING org.apache.hcatalog.pig.HCatStorer();

But my problem is, I have many tables each having hundreds of columns. It would become tedious to specify the alias as above.

Any help to fix this would be greatly appreciated.

2

There are 2 answers

0
Piyush Jindal On

You can use $0, $1 and so on to access the columns and please do rename them as column name for example : $0 as cmr_id

1
Lester Martin On

Yep, no joy on this one, but it does seem unlikely you would have that exact solution, especially since your join-returned relation will have both of the join keys in them (example - A::cmr_id and B::cmr_id). You hit on the only real solution already; project it appropriately with FOREACH/GENERATE and rename the column names. In practice, you'll likely have to do this for real Hive structures anyway since you will have to have the columns not just named correctly, but in the correct sequence. Not to mention it be unlikely that a "real" Hive table would have the join key's value stored twice.

The only other solution I can think of (and I'm not recommending) would be to STORE C as a file on HDFS that you have a non-managed (likely EXTERNAL) Hive table configured to point to the directory you just stored the file into. You could also have a Hive view pre-created that sequences, possibly trims extra columns (like the duplicate cmr_id), the columns so that you could then do a new LOAD command using HCatLoader and then use that alias for the HCatStorer STORE command. This might look better in your Pig script, but you'd still have to do most of the work (just in Hive) and definitely will have a performance impact since you'd have to write, and then read, the HDFS file represented by C prior to saving it into desired Hive table.