How MySQL Delete Duplicate Rows

MySQL is a relational dataset that stores data in tables that have rows and columns. However, the data stored in the database may contain duplicate values caused by errors in the applications or users.

In this tutorial, we shall learn how to remove duplicate rows in a MySQL database to reduce database size and help to increase the server’s performance.

Before we continue, we are assuming:

  1. You have MySQL installed and running on your system
  2. You have root access to the database.
  3. You have access to a database for experimenting or testing

NOTE: If you need a sample database to try out the concepts provided in this guide, please consider the Sakila database or download a copy of the database used in this guide.

Resources are provided below:

Basic Usage

Before we begin, we will intentionally create a table containing duplicate values for testing purposes. The SQL queries to perform this action are below:

USE world;
DROP TABLE IF EXISTS users;
CREATE TABLE users (id INT PRIMARY KEY NOT NULL AUTO_INCREMENT, username VARCHAR(10) NOT NULL, full_name VARCHAR(20), email VARCHAR(255) NOT NULL);
INSERT INTO users (username, full_name, email) VALUES
    ("virgo", "Claude M. Mori", "claude@gmail.com"),
    ("pulsa", "Tiffany G. Bailey", "tiffany.g@hotmail.com"),
    ("rocket", "Christopher S. Payton", "chris.p@llc.io"),
    ("darkmatter", "Patricia J. Fox", "foxg@yahoo.com"),
    ("pwnc", "Faye H. Hartley", "hartley@outlook.com"),
    ("darkmatter", "Patricia J. Fox", "foxg@yahoo.com"),
    ("rocket", "Christopher S. Payton", "chris.p@llc.io"),
    ("artemis", "Wesley C. Dillard", "Wes@dillard.org");

Feel free to modify the query above to fit your needs. You should also ensure that you have the database (world) created to avoid errors.

Now, if we get all the data inside the table and in order by username, we will see the duplicates we have as shown:

mysql> use world;
Database changed
mysql> SELECT * FROM users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username   | full_name             | email                 |
+----+------------+-----------------------+-----------------------+
|  8 | artemis    | Wesley C. Dillard     | Wes@dillard.org       |
|  4 | darkmatter | Patricia J. Fox       | foxg@yahoo.com        |
|  6 | darkmatter | Patricia J. Fox       | foxg@yahoo.com        |
|  2 | pulsa      | Tiffany G. Bailey     | tiffany.g@hotmail.com |
|  5 | pwnc       | Faye H. Hartley       | hartley@outlook.com   |
|  3 | rocket     | Christopher S. Payton | chris.p@llc.io        |
|  7 | rocket     | Christopher S. Payton | chris.p@llc.io        |
|  1 | virgo      | Claude M. Mori        | claude@gmail.com      |
+----+------------+-----------------------+-----------------------+

As you can see from the table above, we have two duplicate values that make the database bigger for no reason and cause slow speeds.

Let us now learn how we can remove these values.

#1 – DELETE JOIN

One way to remove duplicate rows in a database is to use the MySQL DELETE JOIN statement. The query, however, uses ids to remove duplicate values.

For example, to remove the duplicate values in the users table above, we can input:

DELETE  table1 FROM users table1 INNER JOIN users table2 WHERE table1.id < table2.id AND table1.email = table2.email;

Once you execute the query above, you will remove the duplicate values as shown in the output below:

mysql> DELETE table1 FROM users table1 INNER JOIN users table2 WHERE table1.id < table2.id AND table1.email = table2.email;
Query OK, 2 rows affected (0.01 sec)
 
mysql> SELECT * FROM users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username   | full_name             | email                 |
+----+------------+-----------------------+-----------------------+
|  8 | artemis    | Wesley C. Dillard     | Wes@dillard.org       |
|  6 | darkmatter | Patricia J. Fox       | foxg@yahoo.com        |
|  2 | pulsa      | Tiffany G. Bailey     | tiffany.g@hotmail.com |
|  5 | pwnc       | Faye H. Hartley       | hartley@outlook.com   |
|  7 | rocket     | Christopher S. Payton | chris.p@llc.io        |
|  1 | virgo      | Claude M. Mori        | claude@gmail.com      |
+----+------------+-----------------------+-----------------------+

#2 – Row_Number() Function

The second method we can implement is to use MySQL row_number() function. This function is supported in MySQL version 8 and higher.

It works by assigning a sequential int value to each row, with rows containing duplicate values getting a value higher than 1.

To learn more about this function, use the resource provided below:

https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html#function_row-number

Consider the query below that returns the id of the rows with duplicate values:

SELECT id FROM (SELECT id, ROW_NUMBER() OVER (PARTITION BY username ORDER BY username) AS row_var FROM users) t1 WHERE row_var > 1;

Once you execute the above query, you should get the list of ids as shown in the output below:

 +----+
| id |
+----+
|  6 |
|  7 |
+----+
2 rows in set (0.01 sec)

If you want to remove the values, simply replace the SELECT statement with the DELETE statement as shown below:

DELETE FROM users WHERE id IN (SELECT id FROM ( SELECT id, ROW_NUMBER() OVER (PARTITION BY username ORDER BY username) AS row_var FROM users) t1 WHERE row_var > 1);

Finally, you can verify that the duplicate values are removed using the SELECT statement.

mysql> SELECT * from users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username   | full_name             | email                 |
+----+------------+-----------------------+-----------------------+
|  8 | artemis    | Wesley C. Dillard     | Wes@dillard.org       |
|  4 | darkmatter | Patricia J. Fox       | foxg@yahoo.com        |
|  2 | pulsa      | Tiffany G. Bailey     | tiffany.g@hotmail.com |
|  5 | pwnc       | Faye H. Hartley       | hartley@outlook.com   |
|  3 | rocket     | Christopher S. Payton | chris.p@llc.io        |
|  1 | virgo      | Claude M. Mori        | claude@gmail.com      |
+----+------------+-----------------------+-----------------------+

Conclusion

In this tutorial, we discussed the two methods of removing duplicate values from a database. Large databases, especially those in common use, may contain many duplicate values from external imports and other errors. Therefore, there is a need to keep purging duplicate values to ensure that applications perform optimally.



from Linux Hint https://ift.tt/2RtCDDR

Post a Comment

0 Comments