I have a table with 3 partition columns
create table tn(
col1 string,
etc...
)
partitioned by (
time_key date,
region string,
city string
)
stored as orc
tblproperties ("orc.compress"="ZLIB");
City partition can have from a few mb to a few hundred mb. I'm trying to optimize storage so all small files are merged into a single one equal to a block size of 128mb, and bigger files are split accordingly.
Source table has 200 files around 150 mb each. It's not partitioned.
I do a simple insert statement for that.
INSERT INTO TABLE tn PARTITION (time_key, region, city)
SELECT * FROM source_tn;
With the following settings and get this error NullPointerException
.
set hive.merge.tezfiles=true;
set hive.merge.smallfiles.avgsize=128000000;
set hive.merge.size.per.task=128000000;
set hive.merge.orcfile.stripe.level=true;
set hive.auto.convert.join=false;
If I try the insert without these settings it works out fine, so there isn't anything wrong with the data. The problem in this case is each city
subpartition holds around 200 files. With the total amount of files in the time_key
partiton reaching 30-40 thousands.
What's the problem and what can I do?
I'm using hive tez.
Setting this to false helped.