Test for filtering illegal characters from a string

710 views Asked by At

I need to filter out illegal unicode characters from a string as outlined in a guide for preparing data for Amazon cloud search.

Both JSON and XML batches can only contain UTF-8 characters that are valid in 
XML. Valid characters are the control characters tab (0009), carriage return 
(000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 
10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are 
invalid and will cause errors. (For more information, see Extensible Markup 
Language (XML) 1.0 (Fifth Edition).) 

You can use the following regular expression to match invalid characters 
so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/ .

I am trying to write a test for success and failure cases, I am having trouble writing unicode characters that are in the prohibited range.

Edit2: Javascript is the language i am trying to write the tests in

Edit1: Link for Amazon Cloudsearch documentation: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html

1

There are 1 answers

1
R. Martinho Fernandes On

In JavaScript you can use Unicode escape sequences to produce those invalid characters as strings, like so: "\uFFFE", "\uFFFF", "\uD800" and so on. Beware, though: "\uD83C\uDF4C" is a JavaScript string that represents "", the banana character, Unicode code point 1F34C. What the Amazon API forbids are lone surrogates directly encoded in UTF-8. The banana character (1F34C) encoded as UTF-8 is valid (as bytes F0 9F 8D 8C), and therefore that surrogate pair is valid. What would be invalid would be the UTF-8 encoding of D83C itself, i.e., the bytes ED A0 BC.