entity translation to customized entity

54 views Asked by At

There are some user defined entites in the xml data. In order to unescape those entities, we are using below code:-

<xsl:stylesheet version='3.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' >
<xsl:output method="xml" omit-xml-declaration="no" use-character-maps="mdash" />
<xsl:character-map name="mdash">
<xsl:output-character character="&#x2014;" string="&amp;mdash;"/>
<xsl:output-character character="&amp;" string="&amp;amp;" />
<xsl:output-character character="&quot;" string="&amp;quot;" />
<xsl:output-character character="&apos;" string="&amp;apos;" />
<xsl:output-character character="&#167;" string="&amp;sect;"/>
<xsl:output-character character="&#36;" string="&amp;dollar;" />
<xsl:output-character character="&#47;" string="&amp;sol;" />
<xsl:output-character character="&#45;" string="&amp;hyphen;" />
</xsl:character-map>
<!--=================================================================-->
<xsl:template match="@* | node()">
<!--=================================================================-->
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

But there is a special case where &sect; is appearing twice in data, for example:-

Ex- The number &sect;&sect; 1234

The above should example should be converted to a special userdefined entity i.e.

Output- The number &multisect; 1234

The &sect;&sect; should be converted to &multisect;

2

There are 2 answers

1
Martin Honnen On BEST ANSWER

If you want to use a character map, you would first need to process text nodes where you expect the two sect characters to be present and replace them with a single character you don't expect to be used elsewhere; that character could then be converted by the map to the string &multisect; e.g. the stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    exclude-result-prefixes="#all"
    expand-text="yes"
    version="3.0">
  
  <xsl:param name="multisect-sub" static="yes" as="xs:string" select="'«'"/>
  
  <xsl:character-map name="sub">
    <xsl:output-character _character="{$multisect-sub}" string="&amp;multisect;"/>
  </xsl:character-map>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:output method="xml" indent="yes" use-character-maps="sub"/>
  
  <xsl:template match="text()">
    <xsl:apply-templates mode="analyze" select="analyze-string(., '&#xA7;&#xA7;')"/>
  </xsl:template>
  
  <xsl:template mode="analyze" match="fn:match">
    <xsl:text>{$multisect-sub}</xsl:text>
  </xsl:template>

</xsl:stylesheet>

transforms the input

<!DOCTYPE text [
  <!ENTITY sect "&#xA7;">
]>
<text>&sect;&sect; 1234</text>

into the output

<?xml version="1.0" encoding="UTF-8"?>
<text>&multisect; 1234</text>

Note that I used '«' primarily as an example, you might want to need to use a private char or some other character you are sure doesn't occur in your input/output data.

If you want the result to be well-formed you would also need to add a doctype to the output with e.g. xsl:output doctype-system="some.dtd" where you ensure that some.dtd declares e.g. <!ENTITY multisect "&#xA7;&#xA7;">

0
Michael Kay On

You can't achieve this directly in the serializer, as you can with single characters. You will either have to recognise "§§" in the transformation proper (perhaps converting it to some private-use-area character, which is then picked up by xsl:output-character), or you could do it by post-processing the output at the character-stream level.