{-# LANGUAGE OverloadedStrings #-}
import Data.Text.IO
import Data.Text.ICU.Convert
import Prelude hiding (putStrLn)
main = do
conv <- open "utf8" Nothing
putStrLn $ toUnicode conv "h\xffzzah"
This program attempts to decode an invalid UTF-8 string; it prints "h�zzah", the converter having replaced the invalid byte with U+FFFD REPLACEMENT CHARACTER. I would rather it threw an exception (say, Data.Text.ICU.Error.ICUError
). Is there a way to make it do so, or to otherwise report that the decoding didn't actually succeed?
Alternatively, is there a different way of doing character decoding in Haskell which reports errors of this type?
Beyond my comment above, here's a solution: count the number of occurences of U+FFFD in the input UTF-8 byte stream (this is a safe operation because UTF-8 is substring-safe -- see http://research.swtch.com/utf8), then count the occurences in the converted string. If they differ, you had an encoding error during YOUR conversion.