Using sparse columns to reduce repetitive data in log files

342 views Asked by At

I am building an application which logs automotive performance metrics to a table consisting of 100 columns X 500,000 rows per session. Some of the columns are very repetitive (coolant temperature in whole degrees Fahrenheit). Other columns change continuously (spark advance, manifold pressure, air fuel ratio).

Columnstores and page compression are out of the question since the project is targeted at the open source audience. It needs to support MS SQL Express Edition or another free database platform that scales well to large table sizes.

My initial solution is to allow null in some of the columns, which should dramatically reduce storage footprint by not inserting repeating values, and this allows me to increase the log resolution to a higher framerate.

However this introduces an obstacle when selecting discrete rows because certain columns will be 99% null. Therefore it is necessary to create a view (or computed column?) which will select the last row which contained a value in that field. My approach is to use a subquery for each sparse column. This seems grossly inelegant. Is there a more efficient approach I should consider?

SELECT ISNULL(
           val1,
           (
               SELECT TOP 1 val1
               FROM   [values] subv
               WHERE  subv.id <= v.id
                      AND subv.val1 IS NOT NULL
               ORDER BY
                      subv.id DESC
           )
       ) AS val1,
       ISNULL(
           val2,
           (
               SELECT TOP 1 val2
               FROM   [values] subv
               WHERE  subv.id <= v.id
                      AND subv.val2 IS NOT NULL
               ORDER BY
                      subv.id DESC
           )
       ) AS val2,
       ISNULL(
           val3,
           (
               SELECT TOP 1 val3
               FROM   [values] subv
               WHERE  subv.id <= v.id
                      AND subv.val3 IS NOT NULL
               ORDER BY
                      subv.id DESC
           )
       ) AS val3
FROM   [values] v
2

There are 2 answers

4
Serg On

Maybe it's better to create new table instead, which contains only 3 columns, e.g. MetricType, Metricvalue, MeasurementTime. Insert new measurement values only when value changed for particular metric.

0
Mark On

I came up with a partial solution.

My first idea was to find a way to persist values within a scalar function between calls. This proved impossible to do because parameters are copy in/copy out. Table value parameters are the exception, but these are read only, so this does no good.

Instead I opted to go with the SQL-CLR. I wrote a .NET class in C# to persist the values between calls, and mapped this to an SQL scalar function. The code is unsigned, so for this to run:

  1. Database must be marked as trusted
  2. CLR assembly must be marked UNSAFE in order to allow static fields
  3. Also worth noting, this is not a multi-threaded database application and not tested or expected to work as such
  4. ALSO noteworthy, the fatal flaw with this solution is that it only works for sequential, ordered selects, so I guess it is not much of a solution.

Here is the deploy.sql code (my assembly is named SqlClr):

sp_configure 'show advanced options', 1
GO
RECONFIGURE
GO
sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

USE Test

ALTER DATABASE Test SET TRUSTWORTHY ON
GO

IF OBJECT_ID ('cached_value') is not null
BEGIN

    DROP FUNCTION cached_value
END
GO

IF EXISTS (SELECT * FROM sys.assemblies WHERE name = 'SqlClr') 
BEGIN
    DROP ASSEMBLY SqlClr
END
GO

CREATE ASSEMBLY SqlClr 
FROM 'C:\SqlClr\bin\Debug\SqlClr.dll'
WITH PERMISSION_SET = UNSAFE 
GO 

CREATE FUNCTION cached_value(@rowid bigint, @col int, @current_value bigint) 
RETURNS bigint
AS EXTERNAL NAME SqlClr.[SqlClr.StoredProcedures].[cached_value]

Here is the C#.NET code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.SqlServer.Server;
using System.Diagnostics;
using System.Data.SqlClient;

namespace SqlClr
{
    public class StoredProcedures
    {
        public static long last_rowid;
        public static Dictionary<int, long> values = new Dictionary<int, long>();

        [SqlFunction(DataAccess = DataAccessKind.Read, IsDeterministic = false)]
        public static Nullable<long> cached_value(long rowid, int col, Nullable<long> current_value)
        {
            if (rowid < last_rowid)
            {
                values.Clear();
            }

            last_rowid = rowid;

            if (current_value.HasValue)
            {
                values[col] = current_value.Value;
                return current_value.Value;
            }

            if (values.ContainsKey(col))
                return values[col];
            else
                return null;
        }
    }
}

Notice the .NET function should be decorated with IsDeterministic = false

Finally, here is an example call utilizing the function with 25 columns:

SELECT 
dbo.cached_value(id, 1, val1) AS val1,
dbo.cached_value(id, 2, val2) AS val2,
......
dbo.cached_value(id, 25, val25) AS val25
FROM LotsOfValues 

With 25 cols x 500,000 rows populated randomly with 64 bit integers and nulls, doing a select * from LotsOfValues takes 7 seconds to finish. Using the dbo.cached_value(id, n, valn) call for all of the 25 columns, the same query takes 19 seconds. I can get by with a minimum of 50 columns, 25 of which will be sparse, but I think this will scale to my ceiling of 100. This is an acceptable performance hit for the application and the goal of saving space has been achieved.