Two XML files. Roughly same number of lines. One is twice the size of the other. How?

114 views Asked by At

I have been generating an XML sitemap using Access and VBA. I asked our developers to implement a server side solution so that it can be ran every night without me having to remember to do it.

I generate the file by writing text to a file. Very simple. My file is around 1800KB.

The developer's solution writes text to a file (use the XmlWriter VB class). His file is around 900KB.

When he first showed me this I assumed he was missing a lot of data from the sitemap. When I checked the number of lines in each there are only 38 lines difference (out of around 22,500 lines of text).

How can this be?

Not sure if this is the correct stackexchange site to post this on but I don't of a more appropriate one.

Edit

Here is an example of the file

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://www.bodykind.com/index.aspx</loc>
    </url>
    <url>
        <loc>http://www.bodykind.com/category/3-Supplements.aspx</loc>
    </url>
    <url>
        <loc>http://www.bodykind.com/category/4-Wellbeing.aspx</loc>
    </url>
    ...

Both the files are almost exactly the same but the <url> are in a different order and one has about 36 more lines than the other.

Edit 2

I have just checked the document properties. It seems the code set of the 900KB file is UTF-8 but the codeset of the 1800KB file is Unicode. I am assuming this is why there is such a big difference?

Edit 3

Since it is on the verge of being closed, here is the code for both

My VBA

Private Sub Class_Initialize()
    pIndent = True
    Set objADO = CreateObject("ADODB.Stream")
    objADO.Type = 2
    objADO.Charset = "utf-8"
    objADO.LineSeparator = 10
    objADO.Open
    objADO.WriteText "<?xml version=""1.0"" encoding=""UTF-8""?>", 1
End Sub

... some code which writes the text to the file

Public Sub SaveToFile(ByVal PATH As String)
    ' Skip the BOM
    objADO.Position = 3

    Dim BinaryStream As Object
    Set BinaryStream = CreateObject("ADODB.stream")
    BinaryStream.Type = 1
    BinaryStream.Mode = adModeReadWrite
    BinaryStream.Open

    'Strips BOM (first 3 bytes)
    objADO.CopyTo BinaryStream
    objADO.flush
    objADO.Close

    BinaryStream.SaveToFile PATH, 2
    BinaryStream.flush
    BinaryStream.Close

    Set BinaryStream = Nothing
    Set objADO = Nothing
End Sub

The developers solution

Using writer As New XmlTextWriter(Server.MapPath(filename), Encoding.UTF8)
    writer.WriteStartDocument()
    writer.WriteStartElement("urlset")
    writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9")
    writer.Formatting = Formatting.Indented

    writer.WriteStartElement("url")
    writer.WriteElementString("loc", domain + "/index.aspx")
    writer.WriteEndElement()

    writer.WriteStartElement("url")
    writer.WriteElementString("loc", domain + "/aboutus.aspx")
    writer.WriteEndElement()

    ... and so on....
1

There are 1 answers

1
marmarta On BEST ANSWER

If it's twice the size, then one is UTF-8 (the smaller one) and one is UTF-16 (the bigger one). In UTF-16, every ASCII character takes twice as much space as in UTF-8.

(And Unicode means (in Windows) UTF-16).