Following code unescapes the entities from xml:-
<xsl:stylesheet version='3.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output method="xml" omit-xml-declaration="no" use-character-maps="mdash" />
<xsl:character-map name="mdash">
<xsl:output-character character="—" string="&mdash;"/>
<xsl:output-character character="&" string="&amp;" />
<xsl:output-character character=""" string="&quot;" />
<xsl:output-character character="'" string="&apos;" />
<xsl:output-character character="§" string="&sect;"/>
<xsl:output-character character="$" string="&dollar;" />
<xsl:output-character character="/" string="&sol;" />
<xsl:output-character character="-" string="&hyphen;" />
</xsl:character-map>
<xsl:mode on-no-match="shallow-copy"/>
</xsl:stylesheet>
In case of ‐, at all the places entity gets converted from ‐ to ‐.
Also, for one special user entity &userdefined; gets converted to &userdefined;.
Now for the below input xml:-
<name id="123-24">abc‐pqr &userdefined;</name>
The output gets generated as:-
<name id="123‐24">abc‐pqr &userdefined;</name>
The in the above output, the hyphen entity should only be converted to hyphen if the entity is defined in the input. In this case, 123-24 got converted into 123‐24, rather it should have been 123-24.
Also for one special entity, &userdefined; should remain &userdefined; rather than &usedefined;
Remember the XSLT processing model consists of three stages:
The
xsl:output,xsl:character-map, andxsl:output-characterdeclarations affect the way step 3 works; this is possible in Saxon because Saxon includes both the XSLT transformer and the serializer. But there's no similar way of influencing what step 1 does; this is outside the control of the XSLT processor, and in the case of Saxon, it's done in a third-party product over which Saxon has no control.So the XSLT processor has no idea whether a
"‐"character in the node tree was originally written as"‐"or as"‐", and it has no way of finding out.When I've had this problem myself, I've solved it by replacing all occurrences of
&by§before the XML is parsed, and then converting back after it is serialized. This means of course that the transformation has to bear in mind that it will be seeing§hyphen;rather than‐.