Delete Duplicate Rows per Group Except One Using MariaDB CTEs and Row Numbering

Delete all rows per group except one using CTE

Introduction

The problem of deleting all duplicate rows from a table while keeping the most recent or oldest row is a common requirement in data management. In this article, we will explore different approaches to solve this problem, including the use of Common Table Expressions (CTEs) and subqueries.

MariaDB, a popular open-source relational database management system, provides several ways to achieve this goal. However, one method has proven to be particularly efficient: using CTEs with row numbering. In this article, we will delve into the world of MariaDB queries and explore how to delete duplicate rows per group except one using CTE.

Background

Before we dive into the solution, let’s briefly discuss the concepts involved:

  • Row Numbering: A feature that assigns a unique number to each row within a partition of a result set.
  • Common Table Expressions (CTEs): Temporary result sets that are defined within a query. They can be used for recursive queries or as a way to simplify complex queries.
  • Subqueries: Queries nested inside another query, often used to perform calculations or comparisons.

Query Syntax

The original query provided in the Stack Overflow question attempts to delete duplicate rows per group except one using CTE:

WITH CTE AS (
  SELECT asin, ROW_NUMBER() OVER (PARTITION BY asin ORDER BY created_at) AS n
  FROM asin_list
)
DELETE
FROM CTE
WHERE n > 1;

Unfortunately, this query returns an error due to the syntax used in the ROW_NUMBER() function.

Corrected Query Syntax

To solve the problem using CTE, we need to modify the original query. The corrected syntax is as follows:

WITH CTE AS (
  SELECT asin, ROW_NUMBER() OVER (PARTITION BY asin ORDER BY created_at DESC) AS n
  FROM asin_list
)
DELETE FROM asin_list
WHERE asin IN (
  SELECT asin
  FROM CTE
  WHERE n = 1
);

Here’s an explanation of the changes:

  • We use ROW_NUMBER() with PARTITION BY to assign a unique number to each row within a partition. The order is reversed by using DESC.
  • In the DELETE clause, we select only the rows with n = 1, which correspond to the most recent record for each group.

Inserting Duplicate Rows into Temp Table

As an alternative approach, you can insert duplicate rows from asin_list into a temporary table (temp1) and then delete the excess records. Here’s how to do it:

CREATE TEMPORARY TABLE temp1 AS
SELECT *
FROM asin_list
WHERE created_at = (SELECT MAX(created_at) FROM asin_list);

DELETE FROM asin_list
WHERE asin NOT IN (
  SELECT asin
  FROM temp1
);

This approach works, but it’s less efficient than the CTE-based solution.

Using Existence and Subqueries

Another way to solve this problem is by using existence subqueries:

SELECT * -- delete
FROM asin_list AS newer
WHERE EXISTS (
    SELECT *
    FROM asin_list AS older
    WHERE older.asin = newer.asin AND (
        older.created_at < newer.created_at OR
        older.created_at = newer.created_at AND older.pri_key < newer.pri_key
    )
);

This query works by finding rows in newer where a matching row exists in older with an earlier created_at or lower pri_key.

Conclusion

In this article, we explored different methods to delete duplicate rows from a table per group except one. We discovered that using CTEs is the most efficient approach and presented the corrected query syntax.

Additionally, we discussed alternative solutions such as inserting duplicate rows into a temporary table and using existence subqueries.

By understanding these concepts and techniques, you can effectively manage your data and delete duplicate records in your database while keeping the most recent or oldest record.

Example Use Case

Suppose we have an asin_list table with the following structure:

+--------+------------+-----------+
| asin   | created_at | pri_key  |
+--------+------------+-----------+
| A      | 2022-01-01 | 10        |
| B      | 2022-01-02 | 20        |
| C      | 2022-01-03 | 30        |
| A      | 2021-12-31 | 40        |
+--------+------------+-----------+

We want to delete all duplicate rows per group except one, keeping the most recent record.

Using the CTE-based solution:

WITH CTE AS (
  SELECT asin, ROW_NUMBER() OVER (PARTITION BY asin ORDER BY created_at DESC) AS n
  FROM asin_list
)
DELETE FROM asin_list
WHERE asin IN (
  SELECT asin
  FROM CTE
  WHERE n = 1
);

The result will be:

+--------+------------+-----------+
| asin   | created_at | pri_key  |
+--------+------------+-----------+
| A      | 2022-01-01 | 10        |
| B      | 2022-01-02 | 20        |
| C      | 2022-01-03 | 30        |
+--------+------------+-----------+

As you can see, the duplicate records have been deleted, and only the most recent record remains.


Last modified on 2025-04-01