How to replace text in XML document with special chars?

3.4k views Asked by At

Look at the end of this post for an addition to that problem with textboxes!

With this method I want to open a document, replace some text and then leave it alone. It works, thats something to be proud of. :D

public static void replaceInOpenXMLDocument(string pfad, string zuErsetzen, string neuerString)
        {
            using (WordprocessingDocument doc = WordprocessingDocument.Open(pfad, true))
            {
                var res = from bm in doc.MainDocumentPart.Document.Body.Descendants()
                          where bm.InnerText != string.Empty && bm.InnerText.Contains(zuErsetzen) && bm.HasChildren == false
                          select bm;

                foreach (var item in res)
                {
                    item.InsertAfterSelf(new Text(item.InnerText.Replace(zuErsetzen, neuerString)));
                    item.Remove();
                }
                doc.Close();
            }
        }

But it only works on replacing without special characters. For example:

OS will be replaced with Windows over 9000

[OS] will be left as it is.

CASE 1:

In the document:

You use os for whatever purpose you've got.

replaceInOpenXMLDocument("C:\NSA\suspects.docx", "os", "Win 2000");

Will result in this:

You use Win 2000 for whatever purpose you've got.

CASE 2:

With special chars ...

You use [os] for whatever purpose you've got.

replaceInOpenXMLDocument("C:\NSA\suspects.docx", "[os]", "Win 2000");

... it just ignores me:

You use [os] for whatever purpose you've got.

I tried several special characters ()[]{} etc., but they're never replaced.

Is there something I forgot to do? Or is it simply not able to replace with special characters with this method? If so, I just need a simple workaround.

Is there anybody out to help with my desperation? :)

SOLUTION / ADDITION 1:

Thanks to Flowerking for that! This is the code I'm using right now:

public static void replaceInOpenXMLDocument(string pfad, string zuErsetzen, string neuerString)
        {
            using (WordprocessingDocument doc = WordprocessingDocument.Open(pfad, true))
            {
                SimplifyMarkupSettings settings = new SimplifyMarkupSettings
                {
                    NormalizeXml = true, // Merges Run's in a paragraph with similar formatting

                };
                MarkupSimplifier.SimplifyMarkup(doc, settings);

                //zuErsetzen = new XElement("Name", zuErsetzen).Value;
                var res = from bm in doc.MainDocumentPart.Document.Body.Descendants()
                          where bm.InnerText != string.Empty && bm.InnerText.Contains(zuErsetzen) && bm.HasChildren == false
                          select bm;
                // bm.InnerText.Contains(zuErsetzen)

                foreach (var item in res)
                {
                    item.InsertAfterSelf(new Text(item.InnerText.Replace(zuErsetzen, neuerString)));
                    item.Remove();
                }

                doc.Close();
            }
        }

(This code will work for normal documents with normal text in it!)

SOLUTION / ADDITION 2: If you want to replace text in textboxes, I had to do a little workaround. Textboxes are declared as pictures, so the code above won't touch it.

I found an additional class (link) that searches even through textboxes. The ZIP-download includes an exmaple program, easy to understand.

1

There are 1 answers

12
Flowerking On BEST ANSWER

This is happening because the Open XML word usually creates when a text contains special characters might look like :

  <w:r w:rsidRPr="00316587">
    <w:rPr>
      <w:rFonts w:ascii="Consolas" w:hAnsi="Consolas" w:eastAsia="Times New Roman" w:cs="Consolas" />
      <w:color w:val="823125" />
      <w:sz w:val="20" />
      <w:szCs w:val="20" />
      <w:lang w:eastAsia="en-GB" />
    </w:rPr>
    <w:t>[</w:t>
  </w:r>
  <w:proofErr w:type="gramStart" />
  <w:r w:rsidRPr="00316587">
    <w:rPr>
      <w:rFonts w:ascii="Consolas" w:hAnsi="Consolas" w:eastAsia="Times New Roman" w:cs="Consolas" />
      <w:color w:val="823125" />
      <w:sz w:val="20" />
      <w:szCs w:val="20" />
      <w:lang w:eastAsia="en-GB" />
    </w:rPr>
    <w:t>text-to-replace</w:t>
  </w:r>
  <w:proofErr w:type="gramEnd" />
  <w:r w:rsidRPr="00316587">
    <w:rPr>
      <w:rFonts w:ascii="Consolas" w:hAnsi="Consolas" w:eastAsia="Times New Roman" w:cs="Consolas" />
      <w:color w:val="823125" />
      <w:sz w:val="20" />
      <w:szCs w:val="20" />
      <w:lang w:eastAsia="en-GB" />
    </w:rPr>
    <w:t>]</w:t>
  </w:r>
</w:p>

The above shows open xml created for text [text-to-replace]. (Please note this might not always be the case, may be depends on the client you are using).

By the looks of your code doc.MainDocumentPart.Document.Body.Descendants() you are taking all the OpenXmlPart type Descendants for the whole body of the document and trying to replace the text iterating over one-by-one which leaves the actual text to be in one part and the special characters in two sperate parts. Hence the code fails to acheieve the required.

There might be different ways to workaround this.

Solution:

A nice (my preferred) solution would be to normalize the xml using Markup Simplifier from OpenXml Powertools, which will normalize the open xml markup to concatenate the text in a paragraph to simplify working programatically.

Example code:

using (WordprocessingDocument doc =
            WordprocessingDocument.Open("Test.docx", true))
 {
      SimplifyMarkupSettings settings = new SimplifyMarkupSettings
      {
             NormalizeXml = true, // Merges Run's in a paragraph with similar formatting

       };
        MarkupSimplifier.SimplifyMarkup(doc, settings);
  }

Please Refer to my answer here for more info on using MarkupSimplifier

Hope this helps :)