Duplicate detection and removal (DISTINCT, GROUP BY, ROW_NUMBER)

In the beginner course, you learned how to retrieve and summarize data using SELECT and GROUP BY. Now we focus on identifying and removing duplicate records to keep your data clean and accurate.

Duplicates occur when the same data appears more than once. SQL provides multiple ways to detect and handle them.

This pattern is perfect for:
  • Identifying duplicate records
  • Removing repeated values
  • Cleaning datasets before analysis
  • Ensuring accurate reporting

Basic Patterns

Using DISTINCT (remove duplicates)

SELECT DISTINCT column
FROM table_name;

Using GROUP BY (find duplicates)

SELECT column,
       COUNT(*) AS CountRows
FROM table_name
GROUP BY column
HAVING COUNT(*) > 1;

Using ROW_NUMBER (identify duplicates)

SELECT *,
       ROW_NUMBER() OVER (PARTITION BY column ORDER BY column) AS row_num
FROM table_name;

Example 1: Remove duplicate programs

SELECT DISTINCT program
FROM students;

This returns each program only once.

Example 2: Find duplicate departments in courses

SELECT department,
       COUNT(*) AS CountRows
FROM courses
GROUP BY department
HAVING COUNT(*) > 1;

This shows departments that appear more than once.

Example 3: Identify duplicate enrollments by student

SELECT student_id,
       course_id,
       ROW_NUMBER() OVER (PARTITION BY student_id ORDER BY course_id) AS row_num
FROM enrollments;

This assigns a number to each row within a group.

Example 4: Keep only the first record per student

SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY student_id ORDER BY enrollment_id) AS row_num
    FROM enrollments
) AS t
WHERE row_num = 1;

This removes duplicate rows by keeping only the first one.

Why Use Duplicate Detection and Removal?

  • It ensures data accuracy
  • It prevents double counting
  • It cleans messy datasets
  • It prepares data for analysis

Duplicate Detection and Removal Summary

Feature Description
DISTINCT Removes duplicate values
GROUP BY Identifies duplicates using counts
ROW_NUMBER Labels rows to isolate duplicates
Best for Data cleaning and validation

Duplicate handling is like checking a list for repeated names and deciding which ones to keep.

Practice Tasks (Your Turn!)

Task 1: Students Table

Show each program only once.

Task 2: Courses Table

Find departments that appear more than once.

Task 3: Enrollments Table

Assign a row number to each enrollment per student.

SQL Practice Lab: Advanced Querying
Select a Task
Click a task from Practice Tasks to begin.
Results will appear here…