For the sake of exemplifying, let's say I have a parquet file in s3 partitioned by column date with the following format:
s3://my_bucket/path/my_table/date=*
So when I load the table using spark, for example, it shows the following:
+---+-------+--------+
| id|country| date|
+---+-------+--------+
| 1| Italy|20200101|
| 2| France|20200101|
| 3| Japan|20200102|
| 4| USA|20200103|
+---+-------+--------+
I am trying to load this table into Redshift doing something like this:
create table my_table
(
id BIGINT ENCODE az64,
country VARCHAR(256) ENCODE lzo,
date VARCHAR(256) ENCODE lzo
);
copy my_table
from 's3://my_bucket/path/my_table/'
iam_role 'my-role'
format as parquet
partition by 'date';
Is there a way to insert this data into RS using the command COPY or any other method?
*I already tried using manifest file, but could not to a find a way to load the partition column as well
I see 2 ways of doing this:
Perform N COPYs (one per currency) and manually set the currency column to the correct value with each COPY. Since the S3 key contains the currency name it would be fairly easy to script this up. COPY all the data for a currency into the table leaving the "currency" column NULL. Then set the currency column to the correct value for this COPY - WHERE currency is NULL. Repeat. You'll need to vacuum the table afterwards so no ideal.
Set up an external table (Spectrum) pointing to these S3 objects that is also partitioned. Since the partition key is not part of this external table you just need INSERT into the in-disk table including "currency". This is cleaner but you need to go through the effort of setting up the external schema and table w/ partitions. If this is a one-time thing then the down-and-dirty script might be the way to go.
Both should work just with different pros and cons.