I am using HCatalog
to read and write data to Hive from Pig Script as follows:
A = LOAD 'customer' USING org.apache.hcatalog.pig.HCatLoader();
B = LOAD 'address' USING org.apache.hcatalog.pig.HCatLoader();
C = JOIN A by cmr_id,B by cmr_id;
STORE C INTO 'cmr_address_join' USING org.apache.hcatalog.pig.HCatStorer();
Table definition for customer is:
cmr_id int
name string
Address:
addr_id int
cmr_id int
address string
cmr_address_join:
cmr_id int
name string
addr_id int
address string
When I run this, Pig throws the following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1115: Column names should all be in lowercase. Invalid name found: A::cmr_id
I believe this may be because Pig is trying to match Pig generated file names with the Hive columns and it is not exactly matching (A::cmr_id versus cmr_id
). I think HCatalogStorer
is expecting the alias to be cmr_id
and not A::cmr_id
. I wish HCatalogStorer
ignored the alias prefix and only considered the field name.
grunt> DESCRIBE C;
C: {A::cmr_id: int,A::name: chararray,B::addr_id: int,B::cmr_id: int,B::address: chararray}
Is there a way to drop the prefix of the field in Pig (i.e. A::)? Or if someone has a workaround or a solution, it would be great.
I know we can use the following to explicitly add an alias and get this to work.
D = foreach C generate A::cmr_id as cmr_id,A::name as name, B::addr_id as addr_id, B::address as address;
STORE D INTO 'cmr_address_join' USING org.apache.hcatalog.pig.HCatStorer();
But my problem is, I have many tables each having hundreds of columns. It would become tedious to specify the alias as above.
Any help to fix this would be greatly appreciated.
You can use $0, $1 and so on to access the columns and please do rename them as column name for example : $0 as cmr_id