LibreOffice converts files from RTF to PDF to plain RTF text

92 views Asked by At

I'm using jodconverter:

import org.jodconverter.core.document.DefaultDocumentFormatRegistry;
import org.jodconverter.core.office.OfficeException;
import org.jodconverter.core.office.OfficeUtils;
import org.jodconverter.local.JodConverter;
import org.jodconverter.local.office.LocalOfficeManager;

That uses libreoffice to convert RTF files to PDF. The thing is that it converts the majority of them correctly. I'll detail all the proceses of the conversion:

First I encode string rtf file to base64 because the original applicaction invokes another one where the actual conversion happens:

    // Encoding the string to Base64
    String base64Document = new String(DatatypeConverter
                            .printBase64Binary(stDocument.getBytes()));

Then the invoked application converts the string to PDF:

try {
    officeManager = 
        LocalOfficeManager.builder()
        .officeHome(LIBRE_OFFICE_PATH)
        .install().build();

officeManager
    .start();

byte[] inputBytes = Base64.getDecoder().decode(fileInBase64);
ByteArrayInputStream input = new ByteArrayInputStream(inputBytes);

ByteArrayOutputStream output = new ByteArrayOutputStream();
JodConverter
    .convert(input)
    .to(output)
    .as(DefaultDocumentFormatRegistry.PDF)
    .execute();

pdfEnBase64 = 
    Base64.getEncoder()
    .encodeToString(output.toByteArray());

} catch (OfficeException oe) {

Then the caller application decodes it back to normal string:

// Decode base64
byte[] byteDocument = Base64
        .decodeBase64(stDocumentPdf
        .getBytes((java.nio.charset.StandardCharsets.UTF_8)));

Then it downloads it in the browser:

                var uintBytes = new Uint8Array(fileContent);
                var blob = new Blob([uintBytes], { type: 'application/pdf' });
                var link = document.createElement('a');
                link.href = window.URL.createObjectURL(blob);
                link.download = fileName;
                document.body.appendChild(link);
                link.click();
                document.body.removeChild(link);

But some files convert to PDF with core rtf content - this is the actual converted file when I open it in PDF Reader, as if jodconverter copied and pasted the content of the RTF file or something, I can't figure it out:

enter image description here

As seen, Adobe PDF Reader considers it a ligit PDF file, that's why it opens it without any problem.

There is no error whatsoever. A helping hand would be nice.

EDIT:

I just ran tests, there are many rtf documents that convert correctly. I looked at the ones that convert badly like in the question and saw that removing the document header solves the problem. I'm investigating to see why the header is causing the conversion to fail.

0

There are 0 answers