Faster selects on large table with shredded/flattened xml field

469 views Asked by At

Given a MS SQL EXPRESS 2008 R2 Table with records such as below

ArchiveId   ScheduleId  Data
391063      62          <Data>....</Data>
391064      62          <Data>....</Data>
391065      63          <Data>....</Data>

Where XML in Data Field is structured like this

<Data>
  <Value>1.0</Value>
  <Value>2.0</Value>
  <Value>3.0</Value>
  <Value>4.0</Value>
</Data>

What is the fastest way to do a select with the following catches

  • query will need to return upwards of 2M+ rows!
  • be returned by ScheduleId
  • return the "data" column Value nodes shredded into columns
  • and include the id fields archiveid and scheduleid for each row returned
  • The number of "Value" elements is variable per ScheduleId but always the same for a given ScheduleId
  • there are only ever <Value> nodes within the <Data> node and they are always numbers
  • table can have upwards of 50M rows and is currently only indexed on ScheduleId(Non-Unique Non-Clustered) and ArchiveId(PK Clustered)

Ideally I'm Looking for data of the format;

ArchiveId   ScheduleId  Value1  Value2  etc
391063      62          1.0     2.0
391064      62          1.1     2.1

I have tried to use

select 
    ArchiveId, 
    ScheduleId , 
    v.value('(Value/text())[1]', 'float') as value1 , 
    v.value('(Value/text())[2]', 'float') as value2 , 
    v.value('(Value/text())[3]', 'float') as value3 , 
    v.value('(Value/text())[4]', 'float') as value4  
from 
    Archives 
    cross apply [data].nodes('//Data') x(v) 
where 
    ScheduleId = 2499

and straight .values() and .queries()

select 
    ArchiveId,
    ScheduleId,
    Data.value('(/Data/Value/text())[1]', 'float') as value1,
    Data.value('(/Data/Value/text())[2]', 'float') as value2,
    Data.value('(/Data/Value/text())[3]', 'float') as value3,
    Data.value('(/Data/Value/text())[4]', 'float') as value4
from 
    Archives 
where 
    ScheduleId = 2499
order by 
    ArchiveId asc

Both work but are really slow on large datasets and i was wondering if there is a quicker way to do this sort of thing on very large numbers of rows. I realise that no matter what it will take a while but what are my best bets when doing this.

Numerous examples here but they all have more complex or dynamic data structures or have some kind of complex selection requirements based on the xml content itself.

The data i have is always the same structure (one data node and x value nodes) and the selection criteria is not in the xml at all.

Im just looking for the simplest way to pull ALOT of records back while flattening the xml to columns.

EDIT: Essentially we store graphing data in the xml to later plot a line graph. Importantly, whilst the number of elements for the same ScheduleId will always be the same, different ScheduleId's will have a different number of value elements.e.g.

  • All ScheduleId=1 have 3 value elements (time_X, var1_Y, var2_Y)
  • All ScheduleId=2 have 2 value elements (time_X, var1_Y)
  • All ScheduleId=3 have 33 value elements (time_X, var1_Y,......) etc
2

There are 2 answers

1
Brian Pressler On BEST ANSWER

What might be the best solution is to allow your system to add the data with the XML column into the Archive table as it is, but then move the data over on a schedule to a normalized table structure for reporting. You could either set up a SQL Agent Job or create some service program to move or copy the data to your reporting database. Once the data is in your reporting tables, you could either:

  • Purge the records from the Archive table
  • Move Archive table records to another table/database with the same structure.
  • Create a field that flags when the record has been added to the report table/database.

You could choose whichever one would meet your requirements.

Your reporting tables would basically be split into two or three tables. There's the ArchiveTable made of (ArchiveId, ScheduleId). Then you have your ArchiveDataPointTable made of (ArchiveId, ValueId, DataPointValue). You could also create a ValuesTable of (ValueId, ValueDescription) if you wanted labels for your data points. Then your chart reports could run off a pivot queries with only the data points that you need. Because there would be no string parsing, and all the values are numeric, it should be very fast.

2
Brian Pressler On

If you can add fields to the Archives table, you could create persisted calculated fields as functions of the XML data. So for example you create a field value1 and set it to equal Data.value('(/Data/Value/text())[1]', 'float') and then set the persisted flag on the column. This way it will be parsed once when the record is added or updated and then it has it's own data field that you can select for output.

It's a bit clunky to put into practice though because you have to use a udf because you can't use the XML methods directly. But it looks something like this:

GO
create table TempArchive
(
  ArchiveId integer not null,
  ScheduleId integer not null,
  [Data] xml not null,
  CONSTRAINT PK_Archive
  PRIMARY KEY CLUSTERED (ArchiveId)
  WITH (IGNORE_DUP_KEY = OFF)
)
GO
create function udf_getdatacolumn
(
    @data xml,
    @index as int
) RETURNS float
with schemabinding
as
begin
return @data.value('(/Data/Value/text())[sql:variable("@index")][1]', 'float')
end
GO
alter table TempArchive add value1 as (dbo.udf_getdatacolumn(data, 1)) persisted
alter table TempArchive add value2 as (dbo.udf_getdatacolumn(data, 2)) persisted
alter table TempArchive add value3 as (dbo.udf_getdatacolumn(data, 3)) persisted
alter table TempArchive add value4 as (dbo.udf_getdatacolumn(data, 4)) persisted
GO
insert into TempArchive values (1, 2, '<Data>
  <Value>1.0</Value>
  <Value>2.0</Value>
  <Value>3.0</Value>
  <Value>4.0</Value>
</Data>')
GO
select ArchiveId, ScheduleId, Value1, Value2, Value3, Value4 
from TempArchive
GO

Returns:

ArchiveId   ScheduleId  Value1     Value2     Value3     Value4
----------- ----------- ---------- ---------- ---------- ----------
1           2           1          2          3          4

(1 row(s) affected)

Keep in mind that with a lot of data, it's going to take a long time when you first add those computed columns. I'd recommend testing it before putting it into production. It will also increase the size of your table.