Match paragraph in OpenXML SDK to interop paragraph in Word document

310 views Asked by At

The Word interop is insanely slow when I try to parse the text in the document with 100+ pages. I re-wrote my code to use the OpenXML SDK which is much faster. My problem is that once I have found the information in OpenXML document I have to locate it then in the Word document and scroll main window to it. In order to accomplish this I have to somehow match OpenXML paragraph to interop paragraph. I thought that interop paragraphs perfectly match openxml paragraphs, but I was wrong. In fact the interop usually have more paragraphs than in OpenXML. Is there any trick or some kind of information which could help me match them? For example I have figured out that usually interop has 1 more empty paragraph after every row in the table. So I could probably use this information and bear it in mind, however I afraid there much more than just 1 case I have found myself.

UPDATE

Here is below screenshots of simple Add-In I have created to demonstrate the difference between interop and openxml paragraphs on the Word document with simple content like this:

MS Word Document Sample The add-in then retrieves the list of interop paragraphs and list of OpenXML paragraphs and show them side-by-side:

Side-by-side comparison

Here is below the code I used:

var document = Globals.ThisAddIn.Application.ActiveDocument;

if (document == null)
    return;

var interopParagraphs = document
    .StoryRanges
    .Cast<Range>()
    .SingleOrDefault(r => r.StoryType == WdStoryType.wdMainTextStory)
    .Paragraphs
    .Cast<Paragraph>()
    .Select(p => p.Range.Text);

var openXmlDocument = WordprocessingDocument.FromFlatOpcString(document.Content.WordOpenXML);

if (openXmlDocument == null)
    return;

var openXmlParagraphs = openXmlDocument
    .MainDocumentPart
    .Document
    .Body
    .Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>()
    .Select(p => p.InnerText);

var compareDialog = new CompareForm(interopParagraphs, openXmlParagraphs);
compareDialog.ShowDialog();
1

There are 1 answers

0
Chris On

Turning my comment into an answer.


For the case of table rows, you can check to see whether you are looking at an end-of-row paragraph using Range.IsEndOfRowMark.

This property returns True if the specified range is collapsed and is located at the end-of-row mark in a table, and False if not.

You can also use Range.Information[WdInformation.wdAtEndOfRowMarker].

Returns True if the specified selection or range is at the end-of-row mark in a table

Despite the slight difference in the documentation, the range must be collapsed for this property as well. AFAIK, they are equivalent.

I also noticed that this doesn't work if you access a paragraph directly, e.g.e Document.Paragraph[4]. You have to iterate through them for it to work. This does not seem to be documented.