Millions of rows table optimization SQL Server

716 views Asked by At

I have a table with millions of rows:

CREATE TABLE [dbo].[RequestIdentities]
(
    [Id] [bigint] IDENTITY(1,1) NOT NULL,
    [UniqueKey] [nvarchar](256) NULL,
    [Timestamp] [datetime] NULL,

    CONSTRAINT [PK_RequestIdentities] 
        PRIMARY KEY CLUSTERED ([Id] ASC)
                    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, 
                          IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, 
                          ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[RequestIdentities] 
    ADD CONSTRAINT [DF_RequestIdentities_Timestamp]  
        DEFAULT (GETDATE()) FOR [Timestamp]
GO 

A Web API manipulates the database using ADO.NET and does:

  1. Executes this query:

     SELECT 1 
     FROM RequestIdentities WITH (nolock) 
     WHERE UniqueKey = @key
    
  2. If exists:

     if(reader.HasRows)
    
  3. Returns a http response.

  4. Otherwise, it inserts the id into the table:

     INSERT INTO RequestIdentities(UniqueKey) 
     VALUES(@key)
    

There are hundreds of inserts/updates every minutes, is there something I can do to optimize the table, like custom statistics/indexes?

3

There are 3 answers

0
SQLpro On
  1. A FILLFACTOR = 80 on a very big table an especially on a CLUSTERED IDENTITY INDEX is a nonsense. You loose 20 % of space !
  2. if UniqueKey column is UNIQUE, add a UNIQUE constraint.
  3. [Timestamp] is a reserved word, don't use it as a column name.

After having do this job, you can use the following query :

INSERT INTO dbo.RequestIdentities(UniqueKey)
OUTPUT inserted.* INTO ...
SELECT @key
EXCEPT
SELECT UniqueKey
FROM   dbo.RequestIdentities
WHERE  UniqueKey = @Key;

Aand returns the OUTPUT result clause into a table to your client application

5
gotqn On

You can add a column containing the hash value of the field you are searching.

First, add new column to the table:

ALTER TABLE [...]
ADD [UniqueKeyHash] VARBINARY(64);

Then, add index on it:

CREATE INDEX IX_..._UniqueKeyHash ON [...]
(
    [UniqueKeyHash] 
);

Populate the values:

UPDATE [...]
SET [UniqueKeyHash] =  HASHBYTES('SHA2_512', UniqueKey);

Modify the CRUD routines to calculate the HASH, too.

Then, in search:

DECLARE @UniqueKeyHash VARBINARY(64);

SET @UniqueKeyHash = HASHBYTES('SHA2_512', 'some value');

SELECT *
FROM [...]
WHERE [UniqueKeyHas] = @UniqueKeyHash;

Alternatively, you can add the column as computed and persisted in order to skip modifying the CRUD routines.

I am using such search on various places - one of it is in a IP Addresses table from which is search on each user log in and contains millions of records.

If this is too hard for you, you can start with creating an index on the UniqueKey.

4
Panagiotis Kanavos On

1M rows isn't a lot of data for a modern database on machines that use SSDs. A few hundreds of inserts aren't a lot either. While you can optimize this with eg in-memory tables, you have to remove the existing problems first.

In-memory tables can lead to simpler maintenance in some scenarios too.

Problems

This code contains several issues that harm performance.

  • WITH (NOLOCK) is a very bad idea that doesn't improve performance at all. It actually takes *more extensive locks (schema-level), reads dirty, uncommitted data, can return the same data twice, and can even throw an error.
  • The code executes two remote calls, resulting in twice the latency, to insert a single row. Apart from the delay, this means the locks acquired during the SELECT operation need to be held for far longer than needed, potentially blocking other connections that try to use the same table.
  • TOP 1 is a no-op at best. If UniqueKey is really unique, only one result will be returned.

Fix

You can improve this by removing the hints and existence check. An INSERT query can include FROM and WHERE clause which means you can write a single query to only INSERT a new row. You can use the OUTPUT clause to return the ID of the new row.

First, you need to add either a UNIQUE index or constraint on the UniqueKey column. Without one of those, this simply isn't unique. Anyone can insert duplicate values. A UNIQUE constraint actually creates a UNIQUE index:

CREATE UNIQUE INDEX IX_ RequestIdentities_UniqueKey   
   ON RequestIdentities (UniqueKey);

After that, you can conditionally insert and retrieve the new ID with :

INSERT INTO RequestIdentities (UniqueKey)
    OUTPUT inserted.ID
SELECT @key
FROM RequestIdentities
WHERE NOT EXISTS ( select * 
                   from RequestIdentities
                   where UniqueKey = @key)

The query optimizer knows it doesn't need to produce any results for EXISTS ( SELECT * so that doesn't affect performance.

This query will insert the new row and return the new ID. This operation is atomic (it either succeeds or rolls back) so there's no need for an explicit transaction.

You can execute this query through a SqlCommand by using ExecuteScalar(). This will return the new ID, or null if there are no results, because no row was inserted :

using(var connection=new SqlConnection(connString))
using(var cmd=new SqlCommand(query,connection))
{
    cmd.Parameters.Add("@key",SqlDbType.NVarChar,256).Value=key;
    connection.Open();
    var result = cmd.ExecuteScalar();
    if (result!=null)
    {
        var newID=(long)result;
        //Use the ID
        ...
    }
}

You can get fancy using C# 8's pattern matching syntax :

if(result is long newId)
{
   //Use the ID
}

If this code gets too much, you can use a micro-ORM like Dapper :

using(var connection=new SqlConnection(connString))
{
    var result=connection.ExecuteScalar(query,new {key=keyValue});
    if (result is long new ID)
    {
        ...
    }
}

Dapper is used by StackOverflow so its performance is guaranteed.

Other optimizations

If you find there are too many locks with this table, a possible optimization is to use memory optimized tables. Database servers already buffer data aggressively.

The real benefit of a memory optimized table is the different logging, locking and access model. Instead of locks, lightweight in-memory latch objects are used. Since the data is already in memory, the server can use different operators and different types of indexes to retrieve and modify objects.

This documentation example uses in-memory tables for two high-traffic tables:

  • The Cart is a durable in-memory table, whose data is persisted to disk. If the server goes down, the carts are preserved.
  • User sessions is a non-durable in-memory table. If the server goes down, we don't care about sessions

In this case, the table could be:

CREATE TABLE [dbo].[RequestIdentities]
(
    [Id] [bigint] IDENTITY(1,1) NOT NULL 
        PRIMARY KEY NONCLUSTERED,
    [UniqueKey] [nvarchar](256) NULL,
    [Timestamp] [datetime] NULL
)  
WITH (  
    MEMORY_OPTIMIZED = ON,  
    //Assuming we want to retain the data
    DURABILITY = SCHEMA_AND_DATA);  
go  

ALTER TABLE RequestIdentities  
    ADD CONSTRAINT RequestIdentities_UniqueKey  
    UNIQUE NONCLUSTERED (UniqueKey);  
go