Postgresql COPY encoding, how to?

34.7k views Asked by At

I am importing a .txt file that contains imdb information(such as moviename, movieid, actors, directors, rating votes etc) I imported it by using the COPY Statement. I am using Ubuntu 64 bit. The problem is, that there are actors having different names, such as Jonas Åkerlund. That is why postgresql throws an error:

ERROR: missing data for column "actors" CONTEXT: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã" ********** Error **********

ERROR: missing data for column "actors" SQL state: 22P04 Context: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã"

My copy statement looks like this:

COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt' (DELIMITER E'\t', FORMAT CSV, NULL '');

I do not exactly know, how to use the collation statement. Could you help me please? As always, thank you.

1

There are 1 answers

3
Nick Barnes On

Collation only determines how strings are sorted. The important thing when loading and saving them is the encoding.

By default, Postgres uses your client_encoding setting for COPY commands; if it doesn't match the encoding of the file, you'll run into problems like this.

You can see from the message that while trying to read the "Å", Postgres first read an "Ã", and then encountered some kind of error. The UTF8 byte sequence for "Å" is C3 85. C3 happens to be "Ã" in the LATIN1 codepage, while 85 is undefined*. So it's highly likely that the file is UTF8, but being read as if it were LATIN1.

It should be as simple as specifying the appropriate encoding in the COPY command:

COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt'
(DELIMITER E'\t', FORMAT CSV, NULL '', ENCODING 'UTF8');

*I believe Postgres actually maps these "gaps" in LATIN1 to the corresponding Unicode code points. 85 becomes U+0085, a.k.a. "NEXT LINE", which explains why it was treated as a CSV row terminator.