Why was the Python Unicode internal format implemented as described in PEP 100?

626 views Asked by At

http://www.python.org/dev/peps/pep-0100/

PEP 100 states that the internal format, Python Unicode, holds UTF-16 encodings, but addresses the values as UCS-2 (or UCS-4 when compiled with flag --enable-unicode=ucs4).

Why wasn't UTF-16 chosen (a variable length format) as opposed to UCS-2 (fixed length)?

Though the two encodings are largely the same, UTF-16 was already 4 years old when PEP-100 was published (2000 Mar). Was Python Unicode meant to address backwards compatibility issues?

I'm really just curious as to why Python's internal format was implemented using this (seemingly) hybrid approach to store encoded data internally?

A better way to ask my question might be: does anyone have a citation or link with quote from an official document that specifically states why PEP 100 chose to treat UTF-16 as UCS-2 instead of using UTF-16?

1

There are 1 answers

12
John Machin On

Read on a little further: "UCS-2 and UTF-16 are the same for all currently defined Unicode character points" ... and that was true in the year 2000 when the PEP was written. The initial implementation covered only the BMP (first 64K codepoints).