Find duplicates in SQLite table

47 views Asked by At

I have a table in which data from sensors (such as temperature) are collected at certain intervals.
Some of this data changes infrequently, so that successive rows differ only in ID.
I would like to be able to find these rows and delete the duplicates leaving only the oldest and youngest of the rows with same data.

At the moment I do it like this:

SELECT
    m.name
  , c.id
FROM statistics_meta AS m
INNER JOIN statistics AS c ON c.metadata_id = m.id
INNER JOIN statistics AS p ON p.id = (SELECT MAX(t.id) FROM statistics AS t WHERE t.metadata_id = m.id AND t.id < c.id)
INNER JOIN statistics AS n ON n.id = (SELECT MIN(t.id) FROM statistics AS t WHERE t.metadata_id = m.id AND t.id > c.id)
WHERE IFNULL(c.state, 0) = IFNULL(p.state, 0)
  AND IFNULL(c.state, 0) = IFNULL(n.state, 0)

* c - current row, p - previous, n - next

Unfortunately, this query takes a long time and as new rows appear, it will take even longer.
How can this be done better?

sample data:

CREATE TABLE statistics_meta (
  id integer primary key,
  name varchar(10)
);
CREATE TABLE statistics (
  id integer primary key,
  metadata_id integer,
  state integer
);

INSERT INTO statistics_meta (name) values ('temp1'), ('temp2');
INSERT INTO statistics (metadata_id, state) values
  (1, 22),
  (2, 23),
  (1, 23),
  (2, 21),
  (1, 23),
  (2, 22),
  (1, 23),
  (2, 21),
  (1, 23),
  (1, 22),
  (2, 21),
  (2, 21);

SELECT * FROM statistics ORDER BY metadata_id, id;
1

There are 1 answers

1
Booboo On

You talk about ids and temperatures but I find your SQL is difficult to relate to these quantities because you have tables statistics and statistics_meta and various columns id, statistic_id, metadata_id and state with no description of what these tables and columns are. See Tips for asking a good Structured Query Language (SQL) question. So I will address something I can understand and perhaps you can relate this back to your problem:

Assuming you had a table measurement with two columns, id and temperature, then the SQL below might be more efficient than what you have achieved; you will have to try it and see. The idea is that if you have a row with some id value ID and temperature value T, then if the previous and next rows (i.e. id values ID-1 and ID+1) have the same temperature value T you can safely delete the row with id value ID. So we create a subquery that will compute the rows to be deleted and then do the deletion as follows:

/* create table */
create table measurement (
    id integer primary key,
    temperature integer
);
 
/* create some sample data */
insert into measurement(temperature) values
    (72),
    (75),
    (75),
    (77),
    (77),
    (77),
    (77),
    (78),
    (78),
    (78),
    (79)
;

/* display table */
select * from measurement;

Displays:

id    temperature
--    -----------
1     72
2     75
3     75
4     77
5     77
6     77
7     77
8     78
9     78
10    78
11    79

Now delete the duplicate rows:

/* delete the duplicate rows */
delete from measurement where id in (
    select m1.id from
    measurement m1 join measurement m2 on m1.id+1 = m2.id and m1.temperature = m2.temperature
    join measurement m3 on m1.id-1 = m3.id and m1.temperature = m3.temperature
);


/* redisplay table after deletion */
select * from measurement;

Displays:

id    temperature
--    -----------
1     72
2     75
3     75
4     77
7     77
8     78
10    78
11    79