Error trying to convert RTF to HTML using TIKA

40 views Asked by At

I am new to Java and trying to do content format changes.

This is for email bodies and not attachments.

I have the following code for the part when I identify the body is RTF, and we need to convert it to HTML:

    Case "R":
        ContentHandler handler = new ToHTMLContentHandler();
        AutoDetectParser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();

        try {
            InputStream stream = new ByteArrayInputStream(TEXT.getBytes(StandardCharsets.UTF_8));

            parser.parse(stream, handler, metadata);
            System.out.println("temp = " + handler.toString());
            TEXT = handler.toString();
        } catch (Exception f) {
            f.printStackTrace();
            RCOut[0] = "42";
        }

        tp.setContent(TEXT, "UTF-8");
        message.setHeader("Content-Type", "text/html; charset=UTF-8");
        tp.setHeader("Content-Type", "text/html; charset=UTF-8");
        message.setContent(TEXT, "text/html; charset=UTF-8");
    break;

I get an HTML document, but it does not seem to actually be using any of the RTF tags.

Using the string that looks like: {\\rtf1\\ansi\\deff0 {\\fonttbl {\\f0 Courier;}}\r\n {\\colortbl;\\red0\\green0\\blue0;\\red255\\green0\\blue0;}\r\n This line is the default color\\line\r\n \\cf2\r\n \\tab This line is red and has a tab before it\\line\r\n \\cf1\r\n \\page This line is the default color and the first line on page 2\r\n}

I would have expected something like --

    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser">
    <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser">
    <meta name="Content-Type" content="application/rtf">
    <title></title>
    </head>
    <body><p>This line is the default color </p>
    <p style="color:red;">This line is red and has a tab before it </p>

    <p>This line is the default color and the first line on page 2</p>
    </body></html>

What I get is --

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser">
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser">
<meta name="Content-Type" content="application/rtf">
<title></title>
</head>
<body><p>This line is the default color
        This line is red and has a tab before it

This line is the default color and the first line on page 2</p>
</body></html>

Any help would be appreciated.

Thanks!

0

There are 0 answers