Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?

17.1k views Asked by At

In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small"

What settings, changes, or recommendations are there to permit storage and query of multi-lingual strings in Mongo, particularly ones that include these characters above 0xFFFF?

Thanks.

1

There are 1 answers

1
William Z On BEST ANSWER

There are several issues here:

1) Please be aware that MongoDB stores all documents using the BSON format. Also note that the BSON spec referes to a UTF-8 string encoding, not a UTF-16 encoding.

Ref: http://bsonspec.org/#/specification

2) All of the drivers, including the JavaScript driver in the mongo shell, should properly handle strings that are encoded as UTF-8. (If they don't then it's a bug!) Many of the drivers happen to handle UTF-16 properly, as well, although as far as I know, UTF-16 isn't officially supported.

3) When I tested this with the Python driver, MongoDB could successfully load and return a string value that contained a broken UTF-16 code pair. However, I couldn't load a broken code pair using the mongo shell, nor could I store a string containing a broken code pair into a JavaScript variable in the shell.

4) mapReduce() runs correctly on string data using a correct UTF-16 code pair, but it will generate an error when trying to run mapReduce() on string data containing a broken code pair.

It appears that the mapReduce() is failing when MongoDB is trying to convert the BSON to a JavaScript variable for use by the JavaScript engine.

5) I've filed Jira issue SERVER-6747 for this issue. Feel free to follow it and vote it up.