We have a set of XSLT stylesheets that output to "text". Each stylesheet defines its own output encoding which is different between the stylesheets, e.g.:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="no" encoding="windows-1252"/>
...
</xsl:stylesheet>
The stylesheets are fed with various XML data files, some of which may occasionally contain a character that is not representable in the encoding the template declares.
When that happens, an exception occurs upon transformation:
Unable to translate Unicode character \uXXXX at index N to specified code page.
To quickly reproduce:
XDocument schema = XDocument.Parse(
@"<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output method='text' encoding='windows-1252'/>
<xsl:template match='root'>
<xsl:value-of select='.' />
</xsl:template>
</xsl:stylesheet>"
);
XDocument data = XDocument.Parse(
@"<root>Ψ</root>"
);
XslCompiledTransform transformator = new XslCompiledTransform();
using (var xr = schema.CreateReader())
{
transformator.Load(schema.CreateReader());
}
using (var output_stream = new System.IO.MemoryStream())
using (var xr = data.CreateReader())
{
transformator.Transform(xr, null, output_stream);
// Error: Unable to translate Unicode character \u03A8 at index 0 to specified code page.
}
We are happy to replace the occasional offending character with a fallback character (usually that is a ?). The problem is that the transformator appears to ignore ReplacementFallbacks in the passed Encoding and raises the exception anyway:
var xml_writer_settings = transformator.OutputSettings.Clone();
var original_encoging = xml_writer_settings.Encoding;
xml_writer_settings.Encoding = System.Text.Encoding.GetEncoding(
original_encoging.CodePage,
System.Text.EncoderReplacementFallback.ReplacementFallback,
System.Text.DecoderReplacementFallback.ReplacementFallback
);
using (var output_stream = new System.IO.MemoryStream())
using (var xr = data.CreateReader())
using (var xw = XmlWriter.Create(output_stream, xml_writer_settings))
{
transformator.Transform(xr, xw);
// Same error anyway
}
What does work is transforming the template into Unicode, regardless of what it initially requested, and then re-encoding it to its requested encoding:
var xml_writer_settings = transformator.OutputSettings.Clone();
var original_encoging = xml_writer_settings.Encoding;
var sb = new StringBuilder();
using (var output_stream = new System.IO.MemoryStream())
using (var xr = data.CreateReader())
using (var xw = XmlWriter.Create(sb, xml_writer_settings)) // When transforming to StringBuilder, it's always UTF-16
{
transformator.Transform(xr, xw);
var b = original_encoging.GetBytes(sb.ToString()); // Default fallback character is used automatically
output_stream.Write(b, 0, b.Length);
}
but it looks like double work.
Is there a way to make the XslCompiledTransform directly use a fallback character for non-representable characters without the intermediate Unicode step?
I think it is according to the spec that you get an error: https://www.w3.org/TR/xslt-10/#section-Text-Output-Method says "If the result tree contains a character that cannot be represented in the encoding that the XSLT processor is using for output, the XSLT processor should signal an error". If you look at the stack trace you get it seems a complex interaction of some XmlWriter implementation and the text encoding classes, but if I try to find the used classes in the online .NET framework reference source code documentation it seems the XmlWriter implementation that is part of it on purpose raises the exception. So unless you implement your own XmlWriter that handles the case differently I think you will not be able to write directly with a particular
xsl:output encodingto a TextWriter of an encoding that doesn't contain a used output character.