SQL Automatic deduplication

Détails: 17 janvier 2021; 4530

Database
SQL

All living databases has its duplicates. Depending on their nature or number, it may be interesting to dedupe them quickly. For this goal SQL is our friend.

Below a simple example in order to describe the main concepts of the method. Then you will be able to do this in more complex situations. Because the final formula is a bit long, we will proceed iteratively, to understand the different steps.

Import the attached SQL table into a database.

In this anonymized file we can easily view and count duplicates with a GROUP BY:

SELECT firstname, lastname, email, COUNT(id) AS total
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;

firstname	lastname	email	total
David	ROSS	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6
Jay	SHORR	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6
Allan	KOK	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	5
...	...	...	...

This total field counts the duplicates by each record using the unique identifiers (id field). So SQL is able to group the identifiers from duplicates.

Well, now imagine we would able to list these identifiers, to see them, and give them an order of importance. We would then be able to say which ones should be deleted, and which ones should be kept.

Let's start by listing these identifiers beside our duplicates, with a GROUP_CONCAT instruction. It will groups our ids textually, concatenated:

SELECT firstname, lastname, email, COUNT(id) AS total,
GROUP_CONCAT(id) AS all_id_related
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;

firstname	lastname	email	total	all_id_related
David	ROSS	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	440,441,442,443,444,445
Jay	SHORR	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	867,868,869,870,871,872
Allan	KOK	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	5	71,72,73,74,75
...	...	...	...	...

OK, now assume that the last record in a series of duplicates is the one to keep (typically for a users table, the last record from a same user will be the user's reference record, with the most up-to-date information). We keep here the highest id from a duplicate series.

Let's start by just displaying it beside our duplicates, thanks to a MAX statement. With an ORDER BY in the GROUP_CONCAT.

SELECT firstname, lastname, email, COUNT(id) AS total,
GROUP_CONCAT(id ORDER BY id ASC) AS all_id_related,
MAX(id) AS id_to_keep
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;

firstname	lastname	email	total	all_id_related	id_to_keep
David	ROSS	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	440,441,442,443,444,445	445
Jay	SHORR	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	867,868,869,870,871,872	872
Allan	KOK	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	5	71,72,73,74,75	75
...	...	...	...	...	...

Well, now let's try to separate our id_to_keep from other id_related, with a simple SQL REPLACE above the GROUP_CONCAT:

SELECT firstname, lastname, email, COUNT(id) AS total,
REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), '') AS id_to_delete,
MAX(id) AS id_to_keep
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;

firstname	lastname	email	total	id_to_delete	id_to_keep
David	ROSS	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	440,441,442,443,444,	445
Jay	SHORR	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	867,868,869,870,871,	872
Allan	KOK	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	5	71,72,73,74,	75
...	...	...	...	...	...

Caution: in this simple case we keep the MAX id record. This protects us from errors if the sequence of number composing the id_to_keep was present in the id_to_delete (the REPLACE could cause errors).

OK we progress. Remove now the last comma, if not it will bother us for the following of the operations. This with a SUBSTRING and its CHAR_LENGTH:

SELECT firstname, lastname, email, COUNT(id) AS total,
SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1,
CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1) AS id_to_delete,
MAX(id) AS id_to_keep
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;

firstname	lastname	email	total	id_to_delete	id_to_keep
David	ROSS	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	440,441,442,443,444	445
Jay	SHORR	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	6	867,868,869,870,871	872
Allan	KOK	Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.	5	71,72,73,74	75
...	...	...	...	...	...

Great! All that remains is to generate DELETE queries with an IN clause using our id_to_delete:

SELECT COUNT(id) AS total,
CONCAT('DELETE FROM my_users WHERE id IN (',
SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1,
CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1),
') ;') AS delete_query
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;

Working with MySQL, we are forced to display a HAVING in a GROUP BY ... Rrrrrr!

No matter, use a subquery to make it disappear from our results:

SELECT delete_query FROM (
SELECT COUNT(id) AS total,
CONCAT('DELETE FROM my_users WHERE id IN (',
SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1,
CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1),
') ;') AS delete_query
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC) AS x ;

delete_query

DELETE FROM my_users WHERE id IN (440,441,442,443,444) ;

DELETE FROM my_users WHERE id IN (867,868,869,870,871) ;

DELETE FROM my_users WHERE id IN (71,72,73,74) ;

...

Recover your results. Separated by a semicolon, it forms a SQL script that just needs to be executed.

DELETE FROM my_users WHERE id IN (440,441,442,443,444) ;
DELETE FROM my_users WHERE id IN (867,868,869,870,871) ;
DELETE FROM my_users WHERE id IN (71,72,73,74) ;
...

Perfect! You can now focus you on more important things ...

Note that with a previous standardization on a temporary table (correction, harmonisation...), with the exact same ids of course, your final script will be still more efficient (finding still more duplicates).

But our example just eliminates duplicates. Indeed it was only perfect duplicates (exact same name, first name and email).

The next step will be to merge the duplicates. That is, to repatriate the richest information on a single record.

This in a relational database, with potentially phones or addresses located in joined tables. We will see that later ...

Liens ou pièces jointes
Test table	[ ]	104 Ko

Menu

SQL Automatic deduplication