reading chinese text characters using iTextSharp in c#

2.2k views Asked by At

I used iTextSharp for reading pdf file. i can read the english text, but for chinese i am getting question marks, how can i read chinese characters using iTextSharp.

coverNoteFilePath = @"D:\Temp\cc8a12e6-399a-4146-81ac-e49eb67e7e1b\CoverNote.pdf";
    try
    {
        PdfReader reader = new PdfReader(coverNoteFilePath);

        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
            String s = PdfTextExtractor.GetTextFromPage(reader, page, its);

            s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
            coverNoteContent = coverNoteContent + s;

        }
        reader.Close();
        Response.Write(coverNoteContent);
    }
1

There are 1 answers

0
Phil Gan On

Try replacing ASCIIEncoding with one of the other encoding classes (UTF8Encoding for example). I imagine PDF documents know which encoding they use so you might be able to find the correct one in the PdfReader object. Worth checking.

From the MSDN:

ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed. Because the ASCIIEncoding class supports only a limited character set, the UTF8Encoding, UnicodeEncoding, and UTF32Encoding classes are better suited for globalized applications.