I have a bunch of s3 files I want to copy into Redshift (using AWS Data Pipelines and RedshiftCopyActivity). The challenge lies in the fact that my s3 files have one column less then target Redshift table. Table itself has "id" column - an IDENTITY column which values are auto-generated during insert.
I understand that I should/could be using transformSql property of RedshiftCopyActivity but I am failing in constructing helpful query. Execution always returns me an error:
Exception ERROR: cannot set an identity column to a value
Some more details: Identity column is the first column of the table.
Data is successfully inserted into table called staging, as it should be. Also, I see my transformSQL was run and data is inserted into table called staging2. Logs show:
create temporary table staging2 as select myField1, myField2, ..., myFieldN from staging
but after that comes:
INSERT INTO target_table SELECT * FROM staging2
which causes error to happen.
So, how can I approach this and make Redshift ignore the fact that I am offering one column less? Maybe solution could be to make "id" column as the last one, I still didn't try this one. To be honest, I don't like how it sounds - like very fragile approach.
At the end, I wasn't able to make this working using RedshiftCopyActivity. It was always complaining about how value cannot be set to identity column. Event transformSQL parameter didn't help.
The solution that fits my needs utilizes ShellCommandActivity which runs a simple shell script. Basically, the idea is to install PSQL on EC2 running that is running mentioned shell script, connect to Redshift using PSQL and trigger COPY command that copies data from S3 to Redshift tables.
There are no problems with identity column using COPY command.