All living databases has its duplicates. Depending on their nature or number, it may be interesting to dedupe them quickly. For this goal SQL is our friend.
Below a simple example in order to describe the main concepts of the method. Then you will be able to do this in more complex situations. Because the final formula is a bit long, we will proceed iteratively, to understand the different steps.
Import the attached SQL table into a database.
In this anonymized file we can easily view and count duplicates with a GROUP BY
:
SELECT firstname, lastname, email, COUNT(id) AS total FROM my_users GROUP BY firstname, lastname, email HAVING total > 1 ORDER BY total DESC ;
firstname | lastname | total | |
David | ROSS | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 |
Jay | SHORR | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 |
Allan | KOK | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 5 |
... | ... | ... | ... |
This total field counts the duplicates by each record using the unique identifiers (id field). So SQL is able to group the identifiers from duplicates.
Well, now imagine we would able to list these identifiers, to see them, and give them an order of importance. We would then be able to say which ones should be deleted, and which ones should be kept.
Let's start by listing these identifiers beside our duplicates, with a GROUP_CONCAT
instruction. It will groups our ids textually, concatenated:
SELECT firstname, lastname, email, COUNT(id) AS total, GROUP_CONCAT(id) AS all_id_related FROM my_users GROUP BY firstname, lastname, email HAVING total > 1 ORDER BY total DESC ;
firstname | lastname | total | all_id_related | |
David | ROSS | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 440,441,442,443,444,445 |
Jay | SHORR | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 867,868,869,870,871,872 |
Allan | KOK | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 5 | 71,72,73,74,75 |
... | ... | ... | ... | ... |
OK, now assume that the last record in a series of duplicates is the one to keep (typically for a users table, the last record from a same user will be the user's reference record, with the most up-to-date information). We keep here the highest id from a duplicate series.
Let's start by just displaying it beside our duplicates, thanks to a MAX
statement. With an ORDER BY
in the GROUP_CONCAT
.
SELECT firstname, lastname, email, COUNT(id) AS total, GROUP_CONCAT(id ORDER BY id ASC) AS all_id_related, MAX(id) AS id_to_keep FROM my_users GROUP BY firstname, lastname, email HAVING total > 1 ORDER BY total DESC ;
firstname | lastname | total | all_id_related | id_to_keep | |
David | ROSS | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 440,441,442,443,444,445 | 445 |
Jay | SHORR | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 867,868,869,870,871,872 | 872 |
Allan | KOK | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 5 | 71,72,73,74,75 | 75 |
... | ... | ... | ... | ... | ... |
Well, now let's try to separate our id_to_keep from other id_related, with a simple SQL REPLACE
above the GROUP_CONCAT
:
SELECT firstname, lastname, email, COUNT(id) AS total, REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), '') AS id_to_delete, MAX(id) AS id_to_keep FROM my_users GROUP BY firstname, lastname, email HAVING total > 1 ORDER BY total DESC ;
firstname | lastname | total | id_to_delete | id_to_keep | |
David | ROSS | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 440,441,442,443,444, | 445 |
Jay | SHORR | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 867,868,869,870,871, | 872 |
Allan | KOK | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 5 | 71,72,73,74, | 75 |
... | ... | ... | ... | ... | ... |
Caution: in this simple case we keep the MAX
id record. This protects us from errors if the sequence of number composing the id_to_keep was present in the id_to_delete (the REPLACE
could cause errors).
OK we progress. Remove now the last comma, if not it will bother us for the following of the operations. This with a SUBSTRING
and its CHAR_LENGTH
:
SELECT firstname, lastname, email, COUNT(id) AS total, SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1, CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1) AS id_to_delete, MAX(id) AS id_to_keep FROM my_users GROUP BY firstname, lastname, email HAVING total > 1 ORDER BY total DESC ;
firstname | lastname | total | id_to_delete | id_to_keep | |
David | ROSS | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 440,441,442,443,444 | 445 |
Jay | SHORR | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 6 | 867,868,869,870,871 | 872 |
Allan | KOK | Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. | 5 | 71,72,73,74 | 75 |
... | ... | ... | ... | ... | ... |
Great! All that remains is to generate DELETE
queries with an IN
clause using our id_to_delete:
SELECT COUNT(id) AS total, CONCAT('DELETE FROM my_users WHERE id IN (', SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1, CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1), ') ;') AS delete_query FROM my_users GROUP BY firstname, lastname, email HAVING total > 1 ORDER BY total DESC ;
Working with MySQL, we are forced to display a HAVING
in a GROUP BY
... Rrrrrr!
No matter, use a subquery to make it disappear from our results:
SELECT delete_query FROM ( SELECT COUNT(id) AS total, CONCAT('DELETE FROM my_users WHERE id IN (', SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1, CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1), ') ;') AS delete_query FROM my_users GROUP BY firstname, lastname, email HAVING total > 1 ORDER BY total DESC) AS x ;
delete_query |
DELETE FROM my_users WHERE id IN (440,441,442,443,444) ; |
DELETE FROM my_users WHERE id IN (867,868,869,870,871) ; |
DELETE FROM my_users WHERE id IN (71,72,73,74) ; |
... |
Recover your results. Separated by a semicolon, it forms a SQL script that just needs to be executed.
DELETE FROM my_users WHERE id IN (440,441,442,443,444) ; DELETE FROM my_users WHERE id IN (867,868,869,870,871) ; DELETE FROM my_users WHERE id IN (71,72,73,74) ; ...
Perfect! You can now focus you on more important things ...
Note that with a previous standardization on a temporary table (correction, harmonisation...), with the exact same ids of course, your final script will be still more efficient (finding still more duplicates).
But our example just eliminates duplicates. Indeed it was only perfect duplicates (exact same name, first name and email).
The next step will be to merge the duplicates. That is, to repatriate the richest information on a single record.
This in a relational database, with potentially phones or addresses located in joined tables. We will see that later ...