How do I extract the text content from a word document with PHP?

14.8k views Asked by At

I want to extract the text content from the word document with PHP.

I have created a new word document in Microsoft Word for Mac 2011. Edit: have also tested by creating the same document in Microsoft Word under Windows 7.

The contents of the document is

The quick brown fox jumps over the lazy dog

I have saved it to disk as a Word 97-2004 Document (.doc).

I'm using phpoffice/phpword and this code to extract the text:

<?php

$source = "word.doc";

$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

$text = '';

$sections = $phpWord->getSections();

foreach ($sections as $s) {
    $els = $s->getElements();
    foreach ($els as $e) {
        if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
            $text .= $e->getText();
        } elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
            $text .= " \n";
        } else {
            throw new Exception('Unknown class type ' . get_class($e));
        }
    }
}

print $text;

The output of this code is only parts of the text:

The quick brown fox j

Is there a problem with the code, or is it some kind of compatibility issue?

Edit:

If I add a var_dump($els); before foreach ($els as $e) { the output is this:

array(1) {
  [0]=>
  object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
    ["text":protected]=>
    string(21) "The quick brown fox j"
    ["fontStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["type":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "text"
      ["name":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["hint":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["size":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["color":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["bold":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["italic":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["underline":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "none"
      ["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["scale":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
      object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
        ["aliases":protected]=>
        array(1) {
          ["line-height"]=>
          string(10) "lineHeight"
        }
        ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(6) "Normal"
        ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(0) ""
        ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(true)
        ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        int(0)
        ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        array(0) {
        }
        ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["borderTopSize":protected]=>
        NULL
        ["borderTopColor":protected]=>
        NULL
        ["borderLeftSize":protected]=>
        NULL
        ["borderLeftColor":protected]=>
        NULL
        ["borderRightSize":protected]=>
        NULL
        ["borderRightColor":protected]=>
        NULL
        ["borderBottomSize":protected]=>
        NULL
        ["borderBottomColor":protected]=>
        NULL
        ["styleName":protected]=>
        NULL
        ["index":protected]=>
        NULL
        ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
        bool(false)
      }
      ["shading":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["paragraphStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(6) "Normal"
      ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(0) ""
      ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(true)
      ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      int(0)
      ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      array(0) {
      }
      ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["borderTopSize":protected]=>
      NULL
      ["borderTopColor":protected]=>
      NULL
      ["borderLeftSize":protected]=>
      NULL
      ["borderLeftColor":protected]=>
      NULL
      ["borderRightSize":protected]=>
      NULL
      ["borderRightColor":protected]=>
      NULL
      ["borderBottomSize":protected]=>
      NULL
      ["borderBottomColor":protected]=>
      NULL
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["phpWord":protected]=>
    object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
      ["sections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(1) {
        [0]=>
        object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
          ["container":protected]=>
          string(7) "Section"
          ["style":"PhpOffice\PhpWord\Element\Section":private]=>
          object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
            ["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
            string(8) "portrait"
            ["paper":"PhpOffice\PhpWord\Style\Section":private]=>
            object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
              ["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
              array(6) {
                ["A3"]=>
                array(3) {
                  [0]=>
                  int(297)
                  [1]=>
                  int(420)
                  [2]=>
                  string(2) "mm"
                }
                ["A4"]=>
                array(3) {
                  [0]=>
                  int(210)
                  [1]=>
                  int(297)
                  [2]=>
                  string(2) "mm"
                }
                ["A5"]=>
                array(3) {
                  [0]=>
                  int(148)
                  [1]=>
                  int(210)
                  [2]=>
                  string(2) "mm"
                }
                ["Folio"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(13)
                  [2]=>
                  string(2) "in"
                }
                ["Legal"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(14)
                  [2]=>
                  string(2) "in"
                }
                ["Letter"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(11)
                  [2]=>
                  string(2) "in"
                }
              }
              ["size":"PhpOffice\PhpWord\Style\Paper":private]=>
              string(2) "A4"
              ["width":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(11870)
              ["height":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(16787)
              ["styleName":protected]=>
              NULL
              ["index":protected]=>
              NULL
              ["aliases":protected]=>
              array(0) {
              }
              ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
              bool(false)
            }
            ["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
            int(11906)
            ["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
            int(16838)
            ["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
            int(0)
            ["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1)
            ["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["borderTopSize":protected]=>
            NULL
            ["borderTopColor":protected]=>
            NULL
            ["borderLeftSize":protected]=>
            NULL
            ["borderLeftColor":protected]=>
            NULL
            ["borderRightSize":protected]=>
            NULL
            ["borderRightColor":protected]=>
            NULL
            ["borderBottomSize":protected]=>
            NULL
            ["borderBottomColor":protected]=>
            NULL
            ["styleName":protected]=>
            NULL
            ["index":protected]=>
            NULL
            ["aliases":protected]=>
            array(0) {
            }
            ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
            bool(false)
          }
          ["headers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["footers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["elements":protected]=>
          array(1) {
            [0]=>
            *RECURSION*
          }
          ["phpWord":protected]=>
          *RECURSION*
          ["sectionId":protected]=>
          int(1)
          ["docPart":protected]=>
          string(7) "Section"
          ["docPartId":protected]=>
          int(1)
          ["elementIndex":protected]=>
          int(1)
          ["elementId":protected]=>
          NULL
          ["relationId":protected]=>
          NULL
          ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          int(0)
          ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          NULL
          ["mediaRelation":protected]=>
          bool(false)
          ["collectionRelation":protected]=>
          bool(false)
        }
      }
      ["collections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(5) {
        ["Bookmarks"]=>
        object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Titles"]=>
        object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Footnotes"]=>
        object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Endnotes"]=>
        object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Charts"]=>
        object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
      }
      ["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
      array(3) {
        ["DocInfo"]=>
        object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
          ["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          array(0) {
          }
        }
        ["Protection"]=>
        object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
          ["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
          NULL
        }
        ["Compatibility"]=>
        object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
          ["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
          int(12)
        }
      }
    }
    ["sectionId":protected]=>
    NULL
    ["docPart":protected]=>
    string(7) "Section"
    ["docPartId":protected]=>
    int(1)
    ["elementIndex":protected]=>
    int(1)
    ["elementId":protected]=>
    string(6) "5d531b"
    ["relationId":protected]=>
    NULL
    ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    int(0)
    ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    string(7) "Section"
    ["mediaRelation":protected]=>
    bool(false)
    ["collectionRelation":protected]=>
    bool(false)
  }
}
3

There are 3 answers

2
infinigrove On

You can extract txt from a word document using catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/

It can be installed on Ubuntu using

sudo apt-get install catdoc

Once you have catdoc working on your system you can call it from php using shell_exec()

<?php

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

print $text;

?>

Be sure to substitute (fullpath) with the actual path to catdoc and your word doc.

EDIT ---- Addition

If you can save your files as .docx rather than .doc it is a little bit easier. You can use unzip rather than catdoc.

Simply replace:

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

with

$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");

You could use this same technique with most other command line document to text converters. Just replace the command in the shell_exec() with the command that works on your system. You can check How to extract just plain text from .doc & .docx files? (unix) for other unix/linux alternatives

For other PHP alternatives check out How to extract text from word file .doc,docx,.xlsx,.pptx php

7
AudioBubble On

Try to create your reader before

$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}

Answer is based on this example and API documentation

2
Tac Tacelosky On

Rather than check each class for text, you can use

$sections = $phpWord->getSections();

foreach ($sections as $s) {
    $els = $s->getElements();
    /** @var ElementTest $e */
    foreach ($els as $e) {
        $class = get_class($e);
        if (method_exists($class, 'getText')) {
            $text .= $e->getText();
        } else {
            $text .= "\n";
        }
    }
}