Given a MS SQL EXPRESS 2008 R2 Table with records such as below
ArchiveId ScheduleId Data
391063 62 <Data>....</Data>
391064 62 <Data>....</Data>
391065 63 <Data>....</Data>
Where XML in Data Field is structured like this
<Data>
<Value>1.0</Value>
<Value>2.0</Value>
<Value>3.0</Value>
<Value>4.0</Value>
</Data>
What is the fastest way to do a select with the following catches
- query will need to return upwards of 2M+ rows!
- be returned by ScheduleId
- return the "data" column
Value
nodes shredded into columns - and include the id fields archiveid and scheduleid for each row returned
- The number of "Value" elements is variable per ScheduleId but always the same for a given ScheduleId
- there are only ever
<Value>
nodes within the<Data>
node and they are always numbers - table can have upwards of 50M rows and is currently only indexed on ScheduleId(Non-Unique Non-Clustered) and ArchiveId(PK Clustered)
Ideally I'm Looking for data of the format;
ArchiveId ScheduleId Value1 Value2 etc
391063 62 1.0 2.0
391064 62 1.1 2.1
I have tried to use
select
ArchiveId,
ScheduleId ,
v.value('(Value/text())[1]', 'float') as value1 ,
v.value('(Value/text())[2]', 'float') as value2 ,
v.value('(Value/text())[3]', 'float') as value3 ,
v.value('(Value/text())[4]', 'float') as value4
from
Archives
cross apply [data].nodes('//Data') x(v)
where
ScheduleId = 2499
and straight .values() and .queries()
select
ArchiveId,
ScheduleId,
Data.value('(/Data/Value/text())[1]', 'float') as value1,
Data.value('(/Data/Value/text())[2]', 'float') as value2,
Data.value('(/Data/Value/text())[3]', 'float') as value3,
Data.value('(/Data/Value/text())[4]', 'float') as value4
from
Archives
where
ScheduleId = 2499
order by
ArchiveId asc
Both work but are really slow on large datasets and i was wondering if there is a quicker way to do this sort of thing on very large numbers of rows. I realise that no matter what it will take a while but what are my best bets when doing this.
Numerous examples here but they all have more complex or dynamic data structures or have some kind of complex selection requirements based on the xml content itself.
The data i have is always the same structure (one data node and x value nodes) and the selection criteria is not in the xml at all.
Im just looking for the simplest way to pull ALOT of records back while flattening the xml to columns.
EDIT: Essentially we store graphing data in the xml to later plot a line graph. Importantly, whilst the number of elements for the same ScheduleId will always be the same, different ScheduleId's will have a different number of value elements.e.g.
- All ScheduleId=1 have 3 value elements (time_X, var1_Y, var2_Y)
- All ScheduleId=2 have 2 value elements (time_X, var1_Y)
- All ScheduleId=3 have 33 value elements (time_X, var1_Y,......) etc
What might be the best solution is to allow your system to add the data with the XML column into the Archive table as it is, but then move the data over on a schedule to a normalized table structure for reporting. You could either set up a SQL Agent Job or create some service program to move or copy the data to your reporting database. Once the data is in your reporting tables, you could either:
You could choose whichever one would meet your requirements.
Your reporting tables would basically be split into two or three tables. There's the
ArchiveTable
made of (ArchiveId
, ScheduleId). Then you have yourArchiveDataPointTable
made of (ArchiveId
,ValueId
,DataPointValue
). You could also create aValuesTable
of (ValueId
,ValueDescription
) if you wanted labels for your data points. Then your chart reports could run off a pivot queries with only the data points that you need. Because there would be no string parsing, and all the values are numeric, it should be very fast.