C# ServiceStack.Text analyze stream of json

561 views Asked by At

I am creating a json deserializer. I am deserializing a pretty big json file (25mb), which contains a lot of information. It is an array for words, with a lot of duplicates. With NewtonSoft.Json, I can deserialize the input as stream:

using (var fs = new FileStream(@"myfile.json", FileMode.Open, FileAccess.Read))
using (var sr = new StreamReader(fs))
using (var reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        //Read untill I find the narrow subset I need and start parsing and analyzing them directly
        var obj = JObject.Load(reader); //Analyze this object
    }
}

This allows me to keep reading small parts of the json and analyze it and check for duplicates etc.

If I want to do the same with ServiceStack.Text. I am doing something like:

using (var fs = new FileStream(@"myfile.json", FileMode.Open, FileAccess.Read))
using (var sr = new StreamReader(fs))
{
    var result = ServiceStack.Text.JsonSerializer.DeserializeFromReader<MyObject>(sr);
}

MyObject only contains the subset of the json I am interested in, but this creates a massive overhead, as I will get a big array that contains a lot of duplicates.

In the first method I can filter these away immediately and thus not keeping them in memory.

The memory footprint between the two are (this includes the console program overhead):

  • NewtonSoft: 30mb
  • ServiceStack.Text: 215mb

And the time is:

  • NewtonSoft: 2.5s
  • ServiceStack.Text: 1.5s

The memory footprint is quite important, as I will be processing a lot of these.

I do understand that the ServiceStack method will give me the security of TypeSafety, but the memory footprint is more important for me.

As I can see that ServiceStack.Text is a lot faster, so I would like to know if I am able to recreate NewtonSoft example, but with ServiceStack.Text?

Edit (Added the object I try to parse):

public class MyObject
{
    public List<List<Word>> Words { get; set; }
}

public class Word
{
    public string B { get; set; }
    public string W { get; set; }
    public string E { get; set; }
    public string P { get; set; }
}

In my test file (which is representative of use case) it has 29000 words, but only around 8500 unique words. I am only analyzing this data, so I cannot change the structure of it. It is a file containing arrays of arrays of words.

0

There are 0 answers