I have a table in which data from sensors (such as temperature) are collected at certain intervals.
Some of this data changes infrequently, so that successive rows differ only in ID.
I would like to be able to find these rows and delete the duplicates leaving only the oldest and youngest of the rows with same data.
At the moment I do it like this:
SELECT
m.name
, c.id
FROM statistics_meta AS m
INNER JOIN statistics AS c ON c.metadata_id = m.id
INNER JOIN statistics AS p ON p.id = (SELECT MAX(t.id) FROM statistics AS t WHERE t.metadata_id = m.id AND t.id < c.id)
INNER JOIN statistics AS n ON n.id = (SELECT MIN(t.id) FROM statistics AS t WHERE t.metadata_id = m.id AND t.id > c.id)
WHERE IFNULL(c.state, 0) = IFNULL(p.state, 0)
AND IFNULL(c.state, 0) = IFNULL(n.state, 0)
* c
- current row, p
- previous, n
- next
Unfortunately, this query takes a long time and as new rows appear, it will take even longer.
How can this be done better?
sample data:
CREATE TABLE statistics_meta (
id integer primary key,
name varchar(10)
);
CREATE TABLE statistics (
id integer primary key,
metadata_id integer,
state integer
);
INSERT INTO statistics_meta (name) values ('temp1'), ('temp2');
INSERT INTO statistics (metadata_id, state) values
(1, 22),
(2, 23),
(1, 23),
(2, 21),
(1, 23),
(2, 22),
(1, 23),
(2, 21),
(1, 23),
(1, 22),
(2, 21),
(2, 21);
SELECT * FROM statistics ORDER BY metadata_id, id;
You talk about ids and temperatures but I find your SQL is difficult to relate to these quantities because you have tables
statistics
andstatistics_meta
and various columnsid
,statistic_id
,metadata_id
andstate
with no description of what these tables and columns are. See Tips for asking a good Structured Query Language (SQL) question. So I will address something I can understand and perhaps you can relate this back to your problem:Assuming you had a table
measurement
with two columns,id
andtemperature
, then the SQL below might be more efficient than what you have achieved; you will have to try it and see. The idea is that if you have a row with someid
value ID andtemperature
value T, then if the previous and next rows (i.e.id
values ID-1 and ID+1) have the sametemperature
value T you can safely delete the row withid
value ID. So we create a subquery that will compute the rows to be deleted and then do the deletion as follows:Displays:
Now delete the duplicate rows:
Displays: