I want to extract the text content from the word document with PHP.
I have created a new word document in Microsoft Word for Mac 2011. Edit: have also tested by creating the same document in Microsoft Word under Windows 7.
The contents of the document is
The quick brown fox jumps over the lazy dog
I have saved it to disk as a Word 97-2004 Document (.doc).
I'm using phpoffice/phpword and this code to extract the text:
<?php
$source = "word.doc";
$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');
$text = '';
$sections = $phpWord->getSections();
foreach ($sections as $s) {
$els = $s->getElements();
foreach ($els as $e) {
if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
$text .= $e->getText();
} elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
$text .= " \n";
} else {
throw new Exception('Unknown class type ' . get_class($e));
}
}
}
print $text;
The output of this code is only parts of the text:
The quick brown fox j
Is there a problem with the code, or is it some kind of compatibility issue?
Edit:
If I add a var_dump($els);
before foreach ($els as $e) {
the output is this:
array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
["text":protected]=>
string(21) "The quick brown fox j"
["fontStyle":protected]=>
object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["type":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "text"
["name":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["hint":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["size":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["color":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["bold":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["italic":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["underline":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "none"
["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["scale":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"
["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""
["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["shading":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["paragraphStyle":protected]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"
["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""
["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["phpWord":protected]=>
object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
["sections":"PhpOffice\PhpWord\PhpWord":private]=>
array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
["container":protected]=>
string(7) "Section"
["style":"PhpOffice\PhpWord\Element\Section":private]=>
object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
string(8) "portrait"
["paper":"PhpOffice\PhpWord\Style\Section":private]=>
object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
array(6) {
["A3"]=>
array(3) {
[0]=>
int(297)
[1]=>
int(420)
[2]=>
string(2) "mm"
}
["A4"]=>
array(3) {
[0]=>
int(210)
[1]=>
int(297)
[2]=>
string(2) "mm"
}
["A5"]=>
array(3) {
[0]=>
int(148)
[1]=>
int(210)
[2]=>
string(2) "mm"
}
["Folio"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(13)
[2]=>
string(2) "in"
}
["Legal"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(14)
[2]=>
string(2) "in"
}
["Letter"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(11)
[2]=>
string(2) "in"
}
}
["size":"PhpOffice\PhpWord\Style\Paper":private]=>
string(2) "A4"
["width":"PhpOffice\PhpWord\Style\Paper":private]=>
int(11870)
["height":"PhpOffice\PhpWord\Style\Paper":private]=>
int(16787)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
int(11906)
["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
int(16838)
["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
int(0)
["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
int(1)
["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["headers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["footers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["elements":protected]=>
array(1) {
[0]=>
*RECURSION*
}
["phpWord":protected]=>
*RECURSION*
["sectionId":protected]=>
int(1)
["docPart":protected]=>
string(7) "Section"
["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
NULL
["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
NULL
["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}
["collections":"PhpOffice\PhpWord\PhpWord":private]=>
array(5) {
["Bookmarks"]=>
object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Titles"]=>
object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Footnotes"]=>
object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Endnotes"]=>
object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Charts"]=>
object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
}
["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
array(3) {
["DocInfo"]=>
object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
array(0) {
}
}
["Protection"]=>
object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
NULL
}
["Compatibility"]=>
object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
int(12)
}
}
}
["sectionId":protected]=>
NULL
["docPart":protected]=>
string(7) "Section"
["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
string(6) "5d531b"
["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
string(7) "Section"
["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}
You can extract txt from a word document using catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/
It can be installed on Ubuntu using
Once you have catdoc working on your system you can call it from php using shell_exec()
Be sure to substitute (fullpath) with the actual path to catdoc and your word doc.
EDIT ---- Addition
If you can save your files as .docx rather than .doc it is a little bit easier. You can use unzip rather than catdoc.
Simply replace:
with
You could use this same technique with most other command line document to text converters. Just replace the command in the shell_exec() with the command that works on your system. You can check How to extract just plain text from .doc & .docx files? (unix) for other unix/linux alternatives
For other PHP alternatives check out How to extract text from word file .doc,docx,.xlsx,.pptx php