I need to filter out illegal unicode characters from a string as outlined in a guide for preparing data for Amazon cloud search.
Both JSON and XML batches can only contain UTF-8 characters that are valid in
XML. Valid characters are the control characters tab (0009), carriage return
(000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC
10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are
invalid and will cause errors. (For more information, see Extensible Markup
Language (XML) 1.0 (Fifth Edition).)
You can use the following regular expression to match invalid characters
so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/ .
I am trying to write a test for success and failure cases, I am having trouble writing unicode characters that are in the prohibited range.
Edit2: Javascript is the language i am trying to write the tests in
Edit1: Link for Amazon Cloudsearch documentation: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html
In JavaScript you can use Unicode escape sequences to produce those invalid characters as strings, like so:
"\uFFFE"
,"\uFFFF"
,"\uD800"
and so on. Beware, though:"\uD83C\uDF4C"
is a JavaScript string that represents""
, the banana character, Unicode code point 1F34C. What the Amazon API forbids are lone surrogates directly encoded in UTF-8. The banana character (1F34C) encoded as UTF-8 is valid (as bytes F0 9F 8D 8C), and therefore that surrogate pair is valid. What would be invalid would be the UTF-8 encoding of D83C itself, i.e., the bytes ED A0 BC.