Making Data.Text.ICU.Convert.toUnicode report decoding failures

93 views Asked by At
{-# LANGUAGE OverloadedStrings #-}
import Data.Text.IO
import Data.Text.ICU.Convert
import Prelude hiding (putStrLn)
main = do
    conv <- open "utf8" Nothing
    putStrLn $ toUnicode conv "h\xffzzah"

This program attempts to decode an invalid UTF-8 string; it prints "h�zzah", the converter having replaced the invalid byte with U+FFFD REPLACEMENT CHARACTER. I would rather it threw an exception (say, Data.Text.ICU.Error.ICUError). Is there a way to make it do so, or to otherwise report that the decoding didn't actually succeed?

Alternatively, is there a different way of doing character decoding in Haskell which reports errors of this type?

1

There are 1 answers

0
jsalvata On BEST ANSWER

Beyond my comment above, here's a solution: count the number of occurences of U+FFFD in the input UTF-8 byte stream (this is a safe operation because UTF-8 is substring-safe -- see http://research.swtch.com/utf8), then count the occurences in the converted string. If they differ, you had an encoding error during YOUR conversion.