I'm trying to refactor an existing tool to decrease memory usage. The tool processes an XML file which starts off like this:
<?xml version="1.0" encoding="utf-8"?>
<XQ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="my.xsd" SchemaVersion="2.0" SoftwareVersion="2.10.6.195" ExportMode="StrongReferences" System="foo" Database="bar" Description="descriptio of bar database on foo system" Created="2021-11-10T15:14:57.8590869Z" id="9632241b-2b2b-46a4-81b0-fb9bd65c2ef5" ParentKey="a743efc8-7095-4791-b44c-da70bb01f075" ExportedObject="wibble" ExportedType="baz" Identity="bif" Persist="142 {9895150E-085D-4fcb-A16D-5EF5D2527196} 2\{a743efc8-7095-4791-b44c-da70bb01f075}\{9632241b-2b2b-46a4-81b0-fb9bd65c2ef5}*foo\bar">
<APDatabase>
<id>11111111-2222-3333-4444-555555555555</id>
<Name>foo</Name>
<Description>foo database</Description>
<APAttCat>
<id>22222222-2222-2222-2222-222222222222</id>
<Name>just a name</Name>
</APAttCat>
<APElemTemp>
<id>6012ede0-c202-4474-a13a-d9cc349c638e</id>
<Name>name of this elem temp</Name>
<Description>description of this elem temp</Description>
<BaseTemplateOnly>false</BaseTemplateOnly>
<Type>None</Type>
<InstanceType>Elem</InstanceType>
<AllowElemToExtend>true</AllowElemToExtend>
<APAttTemp>
<id>33333333-3333-3333-3333-333333333333</id>
<Name>Name of this att temp</Name>
<Description>Description of this att temp</Description>
<Type>String</Type>
<Value type="String"></Value>
<AttCatRef id="44444444-4444-4444-4444-444444444444">!Configuration</AttCatRef>
</APAttTemp>
</APElemTemp>
...
There is a lot more in these files and they can end up being massive. The important aspect is that the first child element of each <AP...> XML element is an <id> element containing the guid for that parent element. The current program loads the whole thing into an XDocument and adds a 'delete="true"' attribute to all '<AP...>' elements where their child <id> element does not exist in a separate list of guids, then saves to another file.
For example, I would need to write <APAttTemp delete="true"> if the guid for APAttTemp (33333333-3333-3333-3333-333333333333) is not in my separate guids list.
But loading the whole thing in chews up memory and causes issues. I want to do the same thing but without loading the entire xml into memory. Can I do this with XmlReader/XmlWriter? Is there a better way?
I am new to XML processing but so far I have got a reader and writer to open the source XML and make a duplicate of it.
As @dbc says, there is a need to look ahead at the child <id> element to discover whether the current element needs amending. I am thinking that maybe I can cache any <AP...> element and read the next <id> element before then writing them both to the output?
As you know the first element is the
idelement and all you need is to output an attribute for the parent if the value is not in a list you should be able to do that in a streaming manner, for instance using XSLT 3 with streaming (supported for the .NET framework with Saxon EE and .NET Core with SaxonCS):It might also be possible combining XmlReader and XmlWriter to copy everyting through but write out that attribute if an id is encounter that is not on the list of ids.