Manipulating ODT documents with PHP (basic search and replace)

3.4k views Asked by At

With LibreOffice, I have designed and written a text document (ODT format). Now I want to find certain placeholders programmatically and replace them with text from a database.

I know there are some ODT libraries for PHP, but as ODT files are just ZIP files containing XML files (among others), I think this should be possible with basic PHP and without any libraries, shouldn't it?

So I've written a short script which unzips the ODT file, modifies the content.xml and then zips the folder again. You can see the full code below.

While I can do the unzip, replace, zip manually, it does not work when I let the PHP script below do the work. LibreOffice will tell me that it cannot open the document and that it could try to repair it (which does not work, either).

Are there any special requirements that I need to pay attention to? Do I have to modify any meta files apart from the content.xml?

if (unzipFolder('Template.odt', 'temp')) {
    $source = file_get_contents('temp'.DIRECTORY_SEPARATOR.'content.xml');
    $source = str_replace('XXXplaceholder1XXX', 'Example Value #1', $source);
    $source = str_replace('XXXplaceholder2XXX', 'Example Value #2', $source);
    file_put_contents('temp'.DIRECTORY_SEPARATOR.'content.xml', $source);

    zipFolder('temp', 'output/Document.odt');
}

function unzipFolder($zipInputFile, $outputFolder) {
    $zip = new ZipArchive;
    $res = $zip->open($zipInputFile);
    if ($res === true) {
        $zip->extractTo($outputFolder);
        $zip->close();
        return true;
    }
    else {
        return false;
    }
}

function zipFolder($inputFolder, $zipOutputFile) {
    if (!extension_loaded('zip') || !file_exists($inputFolder)) {
        return false;
    }

    $zip = new ZipArchive();
    if (!$zip->open($zipOutputFile, ZIPARCHIVE::CREATE)) {
        return false;
    }

    $inputFolder = str_replace('\\', DIRECTORY_SEPARATOR, realpath($inputFolder));

    if (is_dir($inputFolder) === true) {
        $files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($inputFolder), RecursiveIteratorIterator::SELF_FIRST);

        foreach ($files as $file) {
            $file = str_replace('\\', DIRECTORY_SEPARATOR, $file);

            if (in_array(substr($file, strrpos($file, '/')+1), array('.', '..'))) {
                continue;
            }

            $file = realpath($file);

            if (is_dir($file) === true) {
                $dirName = str_replace($inputFolder.DIRECTORY_SEPARATOR, '', $file.DIRECTORY_SEPARATOR);
                $zip->addEmptyDir($dirName);
            }
            else if (is_file($file) === true) {
                $fileName = str_replace($inputFolder.DIRECTORY_SEPARATOR, '', $file);
                $zip->addFromString($fileName, file_get_contents($file));
            }
        }
    }
    else if (is_file($inputFolder) === true) {
        $zip->addFromString(basename($inputFolder), file_get_contents($inputFolder));
    }

    return $zip->close();
}

Edit #1: The code above does not even work if you just unzip and re-zip the contents of the ODT file, i.e. if you uncomment all the data manipulation. Is something wrong with the format of PHP's ZipArchive output?

Edit #2: More specifically, it is the zipFolder(...) method that breaks everything. You can let PHP do the unzipping, the string manipulation works fine as well (str_replace(...)), but when the zipFolder(...) function creates the archive, it cannot be opened, while it works fine if you create the archive manually (with 7-Zip, e.g.).

Edit #3: I even got it working just by replacing the re-zipping part in PHP with a call to 7-Zip via exec(...). So the problem is definitely creating a proper ZIP archive here. For better portability and fewer dependencies, it would be better, of course, if the solution with PHP's ZipArchive worked and we didn't need 7-Zip.

1

There are 1 answers

3
user555 On BEST ANSWER

There are a number of problems with your zipFolder() function that makes the .odt file broken. The file loader used in LibreOffice is not very forgiving, this might also apply to OpenOffice since the former is a fork of the latter.

Thanks to PHP bug report #48763 I've managed to narrow down the problem. This bug report mostly deals with a problem with ZipArchive::addFromString(), which has been fixed since PHP 5.2.11. However user "Lars" gives an insight to a limitation in the LibreOffice file loader.

"When using windows filesystem separators the .ods zip archive is broken, even though extracting the archive is working."

1. "." and ".." are still included in the archive

You have an if statment going like this:

if (in_array(substr($file, strrpos($file, '/')+1), array('.', '..'))) {
    continue;
}

I don't know if the intention was to filter out the . and .., anyhow it doesn't do the trick. Since you're including .., which together with realpath() translates into the parent directory, you're breaking the .odt file.

2. All directory separators must be forward slashes (unix style)

On windows, directory separators are of the backward slash type (\). This explains why your script is working on linux (as tested by user CrazySabbath) but not on windows (XAMPP). As per the bug report I mentioned in the beginning you must use forward slashes (/) as directory separators for LibreOffice to open your files.

Note also that realpath() on windows will change unix style paths to windows style.

The ZIP file standard states that all slashes MUST be forward slashes, however it seems that ZipArchive let's you ignore the standard by not doing the conversion for you.

4.4.17.1 The name of the file, with optional relative path. The path stored MUST not contain a drive or device letter, or a leading slash. All slashes MUST be forward slashes '/' as opposed to backwards slashes '\' for compatibility with Amiga and UNIX file systems etc.

3. DIRECTORY_SEPARATOR is not necessary

Not a problem with your code, just a general tip. There is no need to use the constant DIRECTORY_SEPARATOR, simply use forward slashes (/) and it will work on *nix and windows systems alike.

However, DIRECTORY_SEPARATOR is still useful for things like exploding or replacing a path.