WPF find all regex matches in a xps document

246 views Asked by At

I need to search an expression inside a xps document then list all matches (with the page number of each match).

I searched in google, but no reference or sample found which addresses this issue .

SO: How can I search a xps document and get this information?

2

There are 2 answers

1
codekaizen On BEST ANSWER

The first thing to note is that an XPS file is an Open Packaging package. It can be opened and the contents accessed via the System.IO.Packaging.Package class. This makes any operations on the contents much easier.

Here's an example of how to search the page content with a given regex, while also tracking which page the match occurs on.

var regex = new Regex(@"th\w+", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);

using(var xps = System.IO.Packaging.Package.Open(@"C:\path\to\regex.oxps"))
{
    var pages = xps.GetParts()
        .Where (p => p.ContentType == "application/vnd.ms-package.xps-fixedpage+xml")
        .ToList();

    for (var i = 0; i < pages.Count; i++)
    {
        var page = pages[i];
        using(var reader = new StreamReader(page.GetStream()))
        {
            var s = reader.ReadToEnd();
            var matches = regex.Matches(s);

            if (matches.Count > 0)
            {
                var matchText = matches
                    .Cast<Match>()
                    .Aggregate (new StringBuilder(), (agg, m) => agg.AppendFormat("{0} ", m.Value));
                Console.WriteLine("Found matches on page {0}: {1}", i + 1, matchText);
            }
        }
    }
}
0
dotNET On

It is not going to be as simple as you might have thought. XPS files are compressed (zipped) files containing a somewhat complex folder structure containing all the text, fonts, graphics and other items. You can use compression tools such as 7-Zip or WinZip etc. to extract the entire folder structure from an XPS file.

Having said that, you can use the following sequence of steps to do what you want:

  1. Extract the contents of your XPS file programmatically in a temp folder. You can use the new ZipFile class for this purpose if you're using .NET 4.5 or better.

  2. The extracted folder will have the following folder structure:

    • _rels
    • Documents
      • 1
        • _rels
        • MetaData
        • Pages
          • _rels
        • Resources
          • Fonts
    • MetaData

    Go to Documents\1\Pages\ subfolder. Here you'll find one or more .fpage files, one for each page of your document. These files are in XML format and contain all text contained in the page in a structured manner.

  3. Use simple loop to iterate through all .fpage files, opening each of them using an XML reader such as XDocument or XmlDocument and search for required text in node values using RegEx.IsMatch(). If found, note down the page number in a List and move ahead.