RE2 and UTF16 (or UCS-2)

1.3k views Asked by At

RE2 is great. Fast and deterministic.

However, it supports only UTF8. My strings are natively UTF16, and converting back and forth would kill performance.

How difficult would it be to implement native UTF16 capability in RE2?

How difficult would it be to implement native UCS-2 capability in RE2? (this should be easier)

i.e. how many hours would a regular programmer need to do this.

This has been bothering me for a couple of weeks, so I thought I would ask!

2

There are 2 answers

0
MustafaM On BEST ANSWER

Russ Cox, the creator of RE2, was kind enough to post the patch for UCS-2 support. Some assertions, however, are not supported for UCS-2. Reply from Russ is posted verbatim:

Hi. RE2 had a UCS-2 mode before I open sourced it, but it could not support assertions like ^, $, and \b, which limited its utility. If you don't need those operators, then it would probably work for you. I don't plan to re-add UCS-2 mode to the RE2 sources, but I did just publish the diff for the change that removed it. You should be able to reverse the diff in a local copy to get the UCS-2 support back. The file is ucs2.diff in the root of the Mercurial repository.

Enjoy.

Link to code: http://code.google.com/p/re2/source/list

2
tchrist On

Have you asked Russ Cox what his opinion might be about the answer to your question? I bet it is much too long to contemplate.

I really think you are overvaluing the cost of converting from ugly UTF-16 into normal UTF-8, and undervaluing the cost of recoding a very highly tuned library.

Just bite the bullet and use UTF-8 like th rest of us.

I’m a big RE2 fan myself, but it never occurred to me to want to use it on UTF-16. UTF-16 is just not part of my world. Just like any other legacy encoding, anything we get in UTF-16 is immediately upgraded to UTF-8 so that the whole toolchain can work with it, because we run a pure-UTF8 toolchain.

Perhaps you live in the opposite world?