SQL Server: set character set (not collation)

53.8k views Asked by At

How does one set the default character set for fields when creating tables in SQL Server? In MySQL one does this:

CREATE TABLE tableName (
    name VARCHAR(128) CHARACTER SET utf8
) DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

Note that I set the character set twice here. It is redundant, I added both ways just to demonstrate.

I set the collation also to demonstrate that collation is something different. I am not asking about setting the collation. Most questions asking about character sets and encodings in SQL Server are answered with collation, which is not the same thing.

2

There are 2 answers

5
Martin Smith On BEST ANSWER

As stated in BOL

Each SQL Server collation specifies three properties:

  • The sort order to use for Unicode data types (nchar, nvarchar, and ntext). A sort order defines the sequence in which characters are sorted, and the way characters are evaluated in comparison operations.
  • The sort order to use for non-Unicode character data types (char, varchar, and text).
  • The code page used to store non-Unicode character data.

The quote above is from 2000 docs. See also this 2008 link. The below also demonstrates this.

DECLARE @T TABLE 
(
     code TINYINT PRIMARY KEY,
     Arabic_CS_AS CHAR(1) COLLATE Arabic_CS_AS NULL,
     Cyrillic_General_CS_AS CHAR(1) COLLATE Cyrillic_General_CS_AS NULL,
     Latin1_General_CS_AS CHAR(1) COLLATE Latin1_General_CS_AS NULL
);

INSERT INTO @T(code) VALUES (200),(201),(202),(203),(204),(205)

UPDATE @T 
  SET Arabic_CS_AS=CAST(code AS BINARY(1)),
      Cyrillic_General_CS_AS=CAST(code AS BINARY(1)),
      Latin1_General_CS_AS=CAST(code AS BINARY(1))

SELECT * 
FROM @T   

Results

code Arabic_CS_AS Cyrillic_General_CS_AS Latin1_General_CS_AS
---- ------------ ---------------------- --------------------
200  ب            И                      È
201  ة            Й                      É
202  ت            К                      Ê
203  ث            Л                      Ë
204  ج            М                      Ì
205  ح            Н                      Í
1
Solomon Rutzky On

To expand on @Martin's answer:

How you set a "character set" in SQL Server depends on the datatype that you are using. If you are using:

  • NVARCHAR, NCHAR, and NTEXT (NTEXT is deprecated and shouldn't be used as of SQL Server 2005) all use the Unicode character set and this cannot be changed. These datatypes are all encoded as UTF-16 LE (Little Endian) – a 16-bit encoding with each "character" being either 2 or 4 bytes – and this too cannot be changed. For these datatypes, the Collation being used only affects the locale (as determined by the LCID of the Collation) which determines the set of rules used for sorting and comparison.

  • XML, like the N-prefixed types, uses the Unicode character set and is encoded as UTF-16 LE (Little Endian), and neither of those can be changed. But unlike the other string datatypes, there is no Collation associated with XML data as it cannot be sorted or compared (at least not without first converting it to NVARCHAR(MAX) [preferred] or VARCHAR(MAX) ).

  • VARCHAR, CHAR, and TEXT (TEXT is deprecated and shouldn't be used as of SQL Server 2005) are all 8-bit encodings with each "character" being either 1 or 2 bytes. The character set is determined by the Code Page associated with each Collation. The sorting and comparison rules depend on the type of Collation being used:

    • SQL Server Collations: These all have names starting with SQL_ and have been deprecated since SQL Server 2000, though are (unfortunately) still in wide use today. These use simple rules indicated as the "SQL Server Sort Order" number as found in the description field returned by sys.fn_helpcollations().
    • Windows Collations: These all have names that do not start with SQL_. These Collations allow the non-Unicode string data to use the Unicode sorting and comparison rules indicated by the LCID of the Collation.

That being said, to find out which character set (for CHAR, VARCHAR, and TEXT – i.e. non-Unicode – data) is being used, run the following query and pay close attention to the CodePage field. The LCID field indicates the locale used for sorting and comparison rules for the N-prefixed – i.e. Unicode – types as well as the non-Unicode types if using a Windows Collation:

SELECT *,
       COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
       COLLATIONPROPERTY(col.[name], 'LCID') AS [LCID]
FROM   sys.fn_helpcollations() col
ORDER BY col.[name];

The Code Page IDs can be translated into something more meaningful via the MSDN page for Code Page Identifiers.


Regarding the O.P.'s comment on @Martin's answer:

It is unfortunate that they chose the misleading/incomplete term "collation" which clearly refers to sort order: collate definition.

While it is true that Microsoft could have done better when choosing a name, there is unfortunately a general, industry-wide confusion over terms such as "encoding", "character set", "collation", etc. Microsoft's use (or misuse) of "Collation" has merely contributed to the mass confusion. But, that confusion is also evident in MySQL as shown in this question, given that "utf8" is specifically not a character set ;-).

UTF-8 is one of several encodings for the Unicode character set. UTF-16 and UTF-32 are the other two encodings. All three of those encodings represent the exact same Unicode character set, just in different ways. Looking at the list of MySQL character sets – 11.1.10 Supported Character Sets and Collations – the "ucs2", "utf8", "utf8mb4", "utf16", "utf16le", "utf32" charsets are not actually character sets, per se, but various representations of the Unicode character set. But, given the overlap between the concepts of "character set" and "encoding", it would be difficult to not have this confusion. The 11.1.10.1 Unicode Character Sets page indicates that the "utf8mb4", "utf16", "utf16le", and "utf32" charsets are the full Unicode character sets while "ucs2" and "utf8" are subsets of the Unicode character set, specifically the first 65,536 code points (a.k.a. Basic Multilingual Plane (BMP)).

For more info regarding Collation across various RDBMS's, please see my answer to the following question on DBA.StackExchange:

Does any DBMS have a collation that is both case-sensitive and accent-insensitive?


UPDATE 2018-10-02

While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.