I currently have a service in C# that uses dapper to call a stored procedure that does 2 things: if the customer exists, it grabs the customer GUID and adds it to the CustomerInformations table; if the customer doesn't exist, then it inserts the customer and then returns the GUID and adds it to the CustomerInformations table.
Previously, inserts were taking around 1.75 million records an hour. Now it can barely get 200,000 records an hour. There's around 75 million records in my CustomerInformations table and I am looking to fixing the bottleneck.
For each Customer property, it's iteratively calling a stored procedure. Each stored procedure call could have 2 inserts into the database. First, adding the customer into the Customers table and then adding the property into the CustomerInformations table. I understand that this may not be the most ideal way to store the data but it's not something I can change.
C# service
foreach (var info request.Data)
{
string sql = "add_one_by_customer";
object parameters = new
{
p_customer_first_name = info.FirstName,
p_customer_last_name = info.LastName,
p_customer_property_name = info.PropertyName,
p_customer_property_value = info.PropertyValue
};
try
{
await db.ExecuteAsync(sql, parameters, transaction: transaction, commandType: CommandType.StoredProcedure);
}
catch (Exception e)
{
throw new Exception($"Failed to insert");
}
}
Postgres stored procedure:
CREATE OR REPLACE PROCEDURE add_one_by_customer(
p_customer_first_name VARCHAR,
p_customer_last_name VARCHAR,
p_customer_property_name VARCHAR,
p_customer_property_value VARCHAR,
)
LANGUAGE plpgsql
AS $procedure$
DECLARE p_customer_id uuid;
p_current_item_value varchar;
begin
SELECT INTO p_customer_id,
customer_id
FROM customers
WHERE customer_first_name = p_customer_first_name AND
customer_last_name = p_customer_last_name
limit 1;
IF (p_customer_id IS NULL) THEN
begin
INSERT INTO customers(customer_first_name, customer_last_name)
VALUES (p_customer_first_name, p_customer_last_name) RETURNING customer_id into p_customer_id;
EXCEPTION WHEN unique_violation THEN
p_customer_id = (SELECT custmomer_id
FROM customers
WHERE customer_first_name = p_customer_first_name AND
customer_last_name = p_customer_last_name
END;
end if;
p_current_item_value := (select property_value
from customer_informations
where customer_id = p_customer_id AND
customer_property_name = p_customer_property_name);
if (p_current_item_value is NULL) THEN
INSERT INTO customer_informations(customer_id, customer_property_name, customer_property_value)
VALUES (p_customer_id, p_customer_property_name, p_customer_property_value);
elseif (p_current_item_value is not null AND p_current_item_value != p_item_value) then
UPDATE customer_informations
SET customer_property_value = p_current_item_value
WHERE customer_id = p_customer_id ;
end if;
end; $procedure$;
Currently my CustomerInformations table has a unique constraint on Customer_Id, Customer_property_name.
Things I've tried to enhance:
- Parallelize in the service (which is why you see the unique violation exception line in the stored procedure) this does speed it up but not enough.
- I am considering dropping the unique constraint and the index but I am not sure how easy it'll be to clean up duplicates (other individuals interact with the database)
Any tips or suggestions would be greatly appreciated.
Customer_information unique constraint:
CONSTRAINT ux_customer_informations UNIQUE (customer_id, customer_property_name)
Customers unique constraint:
CONSTRAINT ux_customers UNIQUE (customer_firstname, customer_lastname)
Your current procedure is hugely inefficient. See:
Avoid a nested code block with error handling, that's very expensive. Can be done properly with the "SELECT or INSERT" technique I use below. See:
The second part is an UPSERT in disguise. Also much cheaper now:
This requires a
UNIQUEconstraint on each of the two tables - exactly the ones you declared (ux_customer_informations&ux_customers).If neither
customer_property_valuenorp_current_item_valuecan benull, simplify the finalWHEREclause to: