Delete all rows per group except one using CTE
Introduction
The problem of deleting all duplicate rows from a table while keeping the most recent or oldest row is a common requirement in data management. In this article, we will explore different approaches to solve this problem, including the use of Common Table Expressions (CTEs) and subqueries.
MariaDB, a popular open-source relational database management system, provides several ways to achieve this goal. However, one method has proven to be particularly efficient: using CTEs with row numbering. In this article, we will delve into the world of MariaDB queries and explore how to delete duplicate rows per group except one using CTE.
Background
Before we dive into the solution, let’s briefly discuss the concepts involved:
- Row Numbering: A feature that assigns a unique number to each row within a partition of a result set.
- Common Table Expressions (CTEs): Temporary result sets that are defined within a query. They can be used for recursive queries or as a way to simplify complex queries.
- Subqueries: Queries nested inside another query, often used to perform calculations or comparisons.
Query Syntax
The original query provided in the Stack Overflow question attempts to delete duplicate rows per group except one using CTE:
WITH CTE AS (
SELECT asin, ROW_NUMBER() OVER (PARTITION BY asin ORDER BY created_at) AS n
FROM asin_list
)
DELETE
FROM CTE
WHERE n > 1;
Unfortunately, this query returns an error due to the syntax used in the ROW_NUMBER() function.
Corrected Query Syntax
To solve the problem using CTE, we need to modify the original query. The corrected syntax is as follows:
WITH CTE AS (
SELECT asin, ROW_NUMBER() OVER (PARTITION BY asin ORDER BY created_at DESC) AS n
FROM asin_list
)
DELETE FROM asin_list
WHERE asin IN (
SELECT asin
FROM CTE
WHERE n = 1
);
Here’s an explanation of the changes:
- We use
ROW_NUMBER()withPARTITION BYto assign a unique number to each row within a partition. The order is reversed by usingDESC. - In the
DELETEclause, we select only the rows withn = 1, which correspond to the most recent record for each group.
Inserting Duplicate Rows into Temp Table
As an alternative approach, you can insert duplicate rows from asin_list into a temporary table (temp1) and then delete the excess records. Here’s how to do it:
CREATE TEMPORARY TABLE temp1 AS
SELECT *
FROM asin_list
WHERE created_at = (SELECT MAX(created_at) FROM asin_list);
DELETE FROM asin_list
WHERE asin NOT IN (
SELECT asin
FROM temp1
);
This approach works, but it’s less efficient than the CTE-based solution.
Using Existence and Subqueries
Another way to solve this problem is by using existence subqueries:
SELECT * -- delete
FROM asin_list AS newer
WHERE EXISTS (
SELECT *
FROM asin_list AS older
WHERE older.asin = newer.asin AND (
older.created_at < newer.created_at OR
older.created_at = newer.created_at AND older.pri_key < newer.pri_key
)
);
This query works by finding rows in newer where a matching row exists in older with an earlier created_at or lower pri_key.
Conclusion
In this article, we explored different methods to delete duplicate rows from a table per group except one. We discovered that using CTEs is the most efficient approach and presented the corrected query syntax.
Additionally, we discussed alternative solutions such as inserting duplicate rows into a temporary table and using existence subqueries.
By understanding these concepts and techniques, you can effectively manage your data and delete duplicate records in your database while keeping the most recent or oldest record.
Example Use Case
Suppose we have an asin_list table with the following structure:
+--------+------------+-----------+
| asin | created_at | pri_key |
+--------+------------+-----------+
| A | 2022-01-01 | 10 |
| B | 2022-01-02 | 20 |
| C | 2022-01-03 | 30 |
| A | 2021-12-31 | 40 |
+--------+------------+-----------+
We want to delete all duplicate rows per group except one, keeping the most recent record.
Using the CTE-based solution:
WITH CTE AS (
SELECT asin, ROW_NUMBER() OVER (PARTITION BY asin ORDER BY created_at DESC) AS n
FROM asin_list
)
DELETE FROM asin_list
WHERE asin IN (
SELECT asin
FROM CTE
WHERE n = 1
);
The result will be:
+--------+------------+-----------+
| asin | created_at | pri_key |
+--------+------------+-----------+
| A | 2022-01-01 | 10 |
| B | 2022-01-02 | 20 |
| C | 2022-01-03 | 30 |
+--------+------------+-----------+
As you can see, the duplicate records have been deleted, and only the most recent record remains.
Last modified on 2025-04-01