I am scraping messages about power plant unavailability and converting them into timeseries and storing them in a SQL Server database.
My current structure is the following:
Messages: publicationDate datetime, messageSeriesID nvarchar, version int, messageId identityThe primary key is on
(messageSeriesId, version)Units: messageId int, area nvarchar, fueltype nvarchar, unitname nvarchar tsId identityThe primary key is on
tsId. There is a foreign key relation on tsId between this table andMessages. The main reason for this table is that one message can contain information about multiple power plants.Timeseries: tsId int, delivery datetime, value decimalI have a partition scheme based on delivery, each partition contains a month of data. The primary key is on
(tsId, delivery)and it's partitioned along the monthly partition scheme. There is a foreign key ontsIdtotsIdin theUnitstable.
The Messages and Units tables contain around a million rows each. The Timeseries table contains about 500 million rows.
Now, every time I insert a new batch of data, one row goes into the Messages table, between one and a few (4) go into the Units table, and a lot (up to 100.000s) go into the Timeseries table.
The problem I'm encountering is that inserts into the Timeseries table are too slow (100.000 rows take up to a minute). I already made some improvements on this by setting the fillfactor to 80 instead of 100 when rebuilding the index there. However its still too slow.
And I am a bit puzzled, because the way I understand it is this: every partition contains all rows with delivery in that month, but the primary key is on tsId first and delivery second. So to insert data in this partition, it should simply be placed at the end of the partition (since tsId is the identity column and thus increasing by one every transaction).
The time series that I am trying to insert spans 3 years and therefore 36 partitions. If I, however, create a time series with the same length that falls within a single partition the insert is notable faster (around 1.5 second). Likewise if I create an empty time series table (timeseries_test) with the same structure as the original one, then inserts are also very fast (also for inserting data that spans 3 years). However, querying is done based mainly on delivery, so I don't think partitioning by tsId is a good idea.
If anyone has a suggestion on the structure or methods to improve inserts it would be greatly appreciated.
Create table statements (I changed the order of the primary key on the timeseries table, but it didn't make any difference, in fact it seemed to slow down inserts):
CREATE TABLE [dbo].[remit_messages]
(
[publicationDate] [datetime2](0) NOT NULL,
[version] [int] NOT NULL,
[messageId] [int] IDENTITY(1,1) NOT NULL,
[messageSeriesId] [nvarchar](36) NOT NULL,
PRIMARY KEY CLUSTERED ([messageSeriesId] ASC, [version] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
/****** Object: Index [dbo_remit_messages_messageId] Script Date: 2024-03-30 13:26:36 ******/
CREATE UNIQUE NONCLUSTERED INDEX [dbo_remit_messages_messageId]
ON [dbo].[remit_messages] ([messageId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF,
DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
CREATE TABLE [dbo].[remit_units]
(
[tsId] [int] IDENTITY(1,1) NOT NULL,
[fuelTypeId] [int] NOT NULL,
[areaId] [int] NOT NULL,
[messageId] [int] NOT NULL,
[unitName] [nvarchar](200) NULL,
PRIMARY KEY CLUSTERED ([messageId] ASC, [tsId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [dbo_remit_tsId] UNIQUE NONCLUSTERED ([tsId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
/****** Object: Index [dbo_remit_units_tsid] Script Date: 2024-03-30 13:30:39 ******/
CREATE NONCLUSTERED INDEX [dbo_remit_units_tsid]
ON [dbo].[remit_units] ([tsId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,
ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
ALTER TABLE [dbo].[remit_units] WITH CHECK
ADD FOREIGN KEY([messageId])
REFERENCES [dbo].[remit_messages] ([messageId])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
CREATE TABLE [dbo].[remit_ts]
(
[tsId] [int] NOT NULL,
[delivery] [datetime2](0) NOT NULL,
[available] [decimal](11, 3) NULL,
[unavailable] [decimal](11, 3) NULL,
PRIMARY KEY CLUSTERED ([delivery] ASC, [tsId] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [MonthlyPartitionScheme]([delivery])
) ON [MonthlyPartitionScheme]([delivery])
GO
/****** Object: Index [idx_remit_ts_delivery_inc] Script Date: 2024-03-30 13:33:34 ******/
CREATE NONCLUSTERED INDEX [idx_remit_ts_delivery_inc]
ON [dbo].[remit_ts] ([delivery] ASC)
INCLUDE([tsId], [unavailable], [available])
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [MonthlyPartitionScheme]([delivery])
GO
ALTER TABLE [dbo].[remit_ts] WITH CHECK
ADD FOREIGN KEY([tsId])
REFERENCES [dbo].[remit_units] ([tsId])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
Actual execution plan: