How To Delete Duplicate Records in SQL? The Complete Guide
Did you know that you can easily delete duplicate records in SQL without affecting the whole database? There are over 7 million SQL developers worldwide, but not all of them are aware of this.
In this blog article, we will explore how to delete duplicate rows in SQL. We will look at various methods available to remove duplicate data from an SQL database, which can help optimize performance and ensure consistency.
What Does the Duplicate Record Delete in SQL Imply?
Deleting duplicate records in SQL is about removing records from a table that show up more than once. In SQL Server, there are several ways to delete duplicate records in a table. If the table has an index unique for each record, you can use it together with self-joins or NOT IN logic to identify and remove duplicates.
However, if the table does not have such an index, one may choose to utilize the ROW NUMBER() function along with CTE (Common Table Expression), allowing the sorting of data before removing any possible duplicated rows from the said database. To learn more about SQL, we would suggest that you take an online SQL course.
Eliminating Duplicate Entries in SQL with INNER JOIN Syntax
When using a database such as SQL, it is important to avoid having the same information stored multiple times. One way to make sure there are no duplicate entries is by using an INNER JOIN statement. This connects two or more tables so that their content can be compared side-by-side and any duplicates can be spotted easily.
To do this, we perform what’s known as a self-join on our table. This gives us access to all of its rows at once for comparison purposes to remove unnecessary repetitions from the data set without causing damage elsewhere in the system.
This is an example of how the INNER JOIN statement can be used to get rid of any duplicate entries from a table.
SELECT a.column_name
FROM table_name a
INNER JOIN table_name b ON a.column_name = b.column_name
WHERE a.primary_key > b.primary_key;
For example, if we have columns named “product_id”, “product_name”, and “category_id”, then we could execute a query that utilizes an inner join like this:
SELECT a.category_id
FROM products a
INNER JOIN products b ON a.category_id = b.category_id
WHERE a.product_id > b.product_id;
This would allow us to only retain unique values from the ‘categories’ column since it returns all records having matching values in both tables. This effectively eliminates any duplicates coming out on the other side.
How to Delete Duplicate Rows in SQL with Group By and Having Clause?
Let’s understand the procedure.
- Group By: This clause is used to group the rows from a table on one or more columns. For example, if you want to delete duplicate records in your “users” based on their first name and last name, you will use groups for this purpose – GROUP BY `first_name`,`last_name`.
- Having Clause: The Having clauses allow us to specify conditions that filter out returned results after applying the Group By statement. In other words, it filters out records with identical values on column(s) specified using “group by”, leaving behind only distinct (unique) values.
For Example:
1. To begin, we can generate a fresh table containing varying data. Follow these instructions to create a sample table.
BEGIN TRANSACTION;
CREATE TABLE students(ID integer, StudentName text, Grade integer);
/* Add some records to the table */
INSERT INTO students VALUES(1,'Alice',87);
INSERT INTO students VALUES(2,'Benjamin',92);
INSERT INTO students VALUES(3,'Charlie',76);
INSERT INTO students VALUES(4,'Diana',95);
INSERT INTO students VALUES(5,'Ethan',83);
INSERT INTO students VALUES(6,'Fiona',78);
INSERT INTO students VALUES(7,'Gavin',91);
INSERT INTO students VALUES(8,'Hannah',88);
INSERT INTO students VALUES(3,'Charlie',76);
INSERT INTO students VALUES(5,'Ethan',83);
INSERT INTO students VALUES(4,'Diana',95);
COMMIT;
2. Show all the entries contained in the table.
/* Display all the records from the table */
SELECT * FROM students;
3. To delete duplicate records in SQL using the Group By and Having clause, you can follow these steps under SQL.
SELECT StudentName, Grade, COUNT() AS cnt
FROM students
GROUP BY StudentName, Grade
HAVING COUNT() > 1;
How to Delete Duplicates in SQL Using Common Table Expressions (CTE)?
A common table expression, or CTE, is a tool used when writing a SQL query. A CTE creates what’s known as a “temporary result set”. It takes the results of one part of your query and uses them as input into another section. It differs from using regular temporary tables because with those you have to create them first manually before being able to use their contents in your main query.
With Common Table Expressions, these can be automatically created within the scope (the range) of where you specify. The SELECT, INSERT, UPDATE, or DELETE commands are all valid scopes here. It also makes complex recursive queries much easier.
The syntax for CTE is as follows:
Syntax
WITH [CTEName]
As
( Select col1,col2,col3 from [tablename] where [condition]
Select col1,col2,col3 from [CTEName]
1. Begin by creating a table named “Sales_DB” in your SQL database.
This table should include columns, such as “order_id”, “customer_name”, “product_name”, and “order date” to store sales data.
For example, the Sales DB would contain information about each of an organization’s individual transactions.
CREATE TABLE Sales_DB (
order_id INT(10),
customer_name VARCHAR(20),
product_name VARCHAR(25),
order_date DATE
);
You can use the INSERT INTO statement to add records, including duplicates, to the
“Sales_DB” table. For instance:
INSERT INTO Sales_DB (order_id, customer_name, product_name, order_date)
VALUES (1, 'John Doe', 'Product A', '2023-05-01');
INSERT INTO Sales_DB (order_id, customer_name, product_name, order_date)
VALUES (2, 'Jane Smith', 'Product B', '2023-05-02');
INSERT INTO Sales_DB (order_id, customer_name, product_name, order_date)
VALUES (1, 'John Doe', 'Product A', '2023-05-01');
In this code, we are trying to insert three records into the “Sales_DB” table. The first two entries contain unique data for their fields. However, the third record has identical values in its key columns as those already found in the original entry, which would make it duplicate.
2. After executing the script and adding data to the “Sales_DB” table, you can employ this query to find and show information in an orderly manner according to customer name and product.
SELECT * FROM Sales_DB ORDER BY customer_name, product_name;
3. To eliminate duplicate rows in the “Sales_DB” table, one can employ a CTE (Common Table Expression) with the ROW_NUMBER() function, which gives each row within a result set a unique sequential number.
Start by creating and naming the CTE using the WITH keyword to retrieve the required columns from “Sales_DB”. The query should also include assigning unique numbers to every entry based on their order_id column value.
WITH CTE AS (
SELECT order_id, customer_name, product_name, order_date,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_id) AS row_num
FROM Sales_DB
)
SELECT order_id, customer_name, product_name, order_date
FROM CTE
WHERE row_num > 1
ORDER BY order_id;
4. Once you have verified that the chosen rows are in fact duplicates that need to be removed, you can utilize a DELETE statement instead of a SELECT statement. Change your previous query accordingly.
DELETE FROM CTE WHERE row_num > 1;
Executing this command will delete all duplicate entries from “Sales_DB”, leaving unique records intact.
Here’s a full code snippet:
-- Step 1: Create the Employ_DB table
CREATE TABLE Employ_DB (
emp_no INT,
emp_name VARCHAR(20),
emp_address VARCHAR(25),
emp_eoj DATE
);
-- Step 2: Insert records into the Employ_DB table (including duplicates)
INSERT INTO Employ_DB (emp_no, emp_name, emp_address, emp_eoj)
VALUES
(11, 'Mohith', 'tokyo', '2000-05-12'),
(12, 'John', 'london', '2001-09-28'),
(11, 'Mohith', 'tokyo', '2000-05-12'),
(13, 'Emma', 'paris', '2003-12-10'),
(14, 'David', 'new york', '2004-07-03'),
(12, 'John', 'london', '2001-09-28');
-- Step 3: Delete duplicate rows using CTE
WITH CTE AS (
SELECT emp_no, emp_name, emp_address, emp_eoj,
ROW_NUMBER() OVER (PARTITION BY emp_no, emp_name, emp_address, emp_eoj ORDER BY emp_no) AS row_num
FROM Employ_DB
)
DELETE FROM CTE
WHERE row_num > 1;
-- Step 4: Verify the remaining unique rows in the Employ_DB table
SELECT * FROM Employ_DB;
Using SQL’s Rank Function to Eliminate Duplicate Rows
When dealing with large amounts of data stored in SQL, sometimes the same information can be repeated multiple times. To reduce these duplicate rows and bring unique values to one row, we use what is known as a “Rank function”. Rank functions are also called Window Functions.
There are four types of rank functions that you can use, these are:
- ROW_NUMBER()
- RANK()
- DENSE_RANK()
- NTILE()
These all work slightly differently but they each allow us to gather specific pieces of data from various places into one place without repeating any content unnecessarily. With these functions, we need to include something called a PARTITION BY Clause which divides our databases into subsections so it knows where the relevant information should go within this new single output row.
Here’s an example:
SELECT *,
RANK() OVER (PARTITION BY Category_id, Category_name ORDER BY entry_id DESC) AS rank
FROM Categories AS C
INNER JOIN (
SELECT DISTINCT Category_id, Category_name
FROM Categories
) AS T ON C.entry_id = T.entry_id
WHERE T.rank > 1;
ALTER TABLE Categories
DROP COLUMN entry_id;
SELECT * FROM Categories;
Here is a complete example, beginning with a table and values.
-- Create the Animals table
CREATE TABLE Animals (
entry_id INT,
Animal_id INT,
Animal_name VARCHAR(50)
);
-- Insert sample values into the Animals table
INSERT INTO Animals (entry_id, Animal_id, Animal_name)
VALUES
(1, 1, 'Cat'),
(2, 2, 'Dog'),
(3, 3, 'Cat'),
(4, 1, 'Cat'),
(5, 2, 'Dog');
-- Display the initial contents of the Animals table
SELECT * FROM Animals;
-- Remove duplicate rows using RANK function
DELETE FROM Animals
WHERE (Animal_id, Animal_name) IN (
SELECT Animal_id, Animal_name
FROM (
SELECT Animal_id, Animal_name,
RANK() OVER (PARTITION BY Animal_id, Animal_name ORDER BY entry_id DESC) AS rank
FROM Animals
) AS RankedAnimals
WHERE rank > 1
);
-- Display the final contents of the Animals table after removing duplicate rows
SELECT * FROM Animals;
This code creates a new table called “Animals” with columns for entry ID, animal ID, and name. Sample values are then added to the table. The RANK function is used to identify any duplicate rows based on both their animal IDs and names. These duplicates will be removed from the initial contents of the Animals table, which can be viewed as part of this process before they’re deleted. This leaves us with an output displaying only non-duplicate results in our Animals column.
Conclusion
Several methods can be employed to delete duplicate records in SQL databases. The Group By and Having clause enables one to group the rows according to specified columns and filter out duplicates. Using Common Table Expressions (CTE) along with ROW_NUMBER() will assign each row its own unique number, which can be used for deleting any repeated data.
You can also check out this comprehensive list of SQL interview questions to excel at your next job interview.
FAQs
Removing duplicate records in SQL is important because of the following reasons:
a. It maintains data accuracy.
b. It removes the possibility of incorrect results when performing data analysis.
c. It is crucial for faster query performance as the large size of the database leads to slower query performance.
To delete duplicate rows in SQL using Rownum, you need to use a Select statement with the Rownum function, along with the Group By and Having clauses. For example,
SELECT Column1, Column2, ROW_NUMBER()
OVER (PARTITION BY Column1, Column2 ORDER BY Column1)
AS RowNumber FROM TableName
GROUP BY Column1, Column2 HAVING COUNT(*) > 1
This statement will group data by Column1 and Column2 and count the number of duplicates. Finally, it will return all records with a count greater than 1.
There are two clauses through which duplicate rows can be found in SQL.
GROUP BY: It is used to arrange identical data into groups. This means that if a certain column contains the same values in multiple rows, it will group those rows together.
HAVING: It allows you to filter records based on aggregate functions. By using the HAVING clause in conjunction with the COUNT() function, you can easily find duplicate rows in a table.