Case insensitive search in Unicode in C++ on Windows

3.3k views Asked by At

I asked a similar question yesterday, but recognize that i need to rephase it in a different way.

In short: In C++ on Windows, how do I do a case-insensitive search for a string (inside another string) when the strings are in unicode format (wide char, wchar_t), and I don't know the language of the strings. I just want to know whether the needle exists in the haystack. Location of the needle isn't relevant to me.

Background: I have a repository containing a lot of email bodies. The messages are in different languages (japanese, german, russian, finnish; you name it). All the data is in Unicode format, and I load it to wide strings (wchar_t) in my C++ application (the bodies have been MIME decoded, so in my debugger I can see the actual japanese, german characters). I don't know the language of the messages since email messages doensn't contain that detail, also a single email body may contain characters from several languages.

I'm looking for something like wcsstr, but with the ability to do the search in a case insensitve manner. I know that it's not possible to do a 100% proper conversion from upper case to lower case, without knowing the language of the text. I want a solution which works in the 99% cases where it's possible.

I'm using Visual Studio 2008 with C++, STL and Boost.

4

There are 4 answers

2
Ferruccio On BEST ANSWER

Boost String Algorithms has an icontains() function template which may do what you need.

0
Serge Wautier On

you could convert both needle and haystack to lowercase (or uppercase) then do the wcsstr().

2
Mark Thornton On

You have to specify the language to do case insensitive comparison. For example in Turkish, 'i' is NOT the lower case letter corresponding to 'I'. If the language appears not to be specified, then the comparison is being done with an implicitly selected language.

6
Michael Dillon On

You should use the ICU library which provides support for Unicode regular expressions which follow the Unicode rules for case-insensitive matching. The library is available as C/C++ and Java libraries. Many other languages such as Python support a wrapper for the ICU libraries.