Test for filtering illegal characters from a string

Question

Test for filtering illegal characters from a string

709 views Asked by Koder At 10 June 2015 at 12:54

I need to filter out illegal unicode characters from a string as outlined in a guide for preparing data for Amazon cloud search.

Both JSON and XML batches can only contain UTF-8 characters that are valid in 
XML. Valid characters are the control characters tab (0009), carriage return 
(000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 
10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are 
invalid and will cause errors. (For more information, see Extensible Markup 
Language (XML) 1.0 (Fifth Edition).) 

You can use the following regular expression to match invalid characters 
so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/ .

I am trying to write a test for success and failure cases, I am having trouble writing unicode characters that are in the prohibited range.

Edit2: Javascript is the language i am trying to write the tests in

Edit1: Link for Amazon Cloudsearch documentation: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html

Original Q&A

There are 1 answers

**R. Martinho Fernandes** · Answer 1 · 2015-06-10T13:26:10+00:00

In JavaScript you can use Unicode escape sequences to produce those invalid characters as strings, like so: "\uFFFE", "\uFFFF", "\uD800" and so on. Beware, though: "\uD83C\uDF4C" is a JavaScript string that represents "", the banana character, Unicode code point 1F34C. What the Amazon API forbids are lone surrogates directly encoded in UTF-8. The banana character (1F34C) encoded as UTF-8 is valid (as bytes F0 9F 8D 8C), and therefore that surrogate pair is valid. What would be invalid would be the UTF-8 encoding of D83C itself, i.e., the bytes ED A0 BC.

TechQA.

Test for filtering illegal characters from a string

There are 1 answers

Related Questions in REGEX

Related Questions in UNICODE

Related Questions in AMAZON-CLOUDSEARCH

Popular Questions

Popular Tags

Trending Questions