Duplicate detection and removal (DISTINCT, GROUP BY, ROW_NUMBER)
In the beginner course, you learned how to retrieve and summarize data using SELECT and GROUP BY. Now we focus on identifying and removing duplicate records to keep your data clean and accurate.
Duplicates occur when the same data appears more than once. SQL provides multiple ways to detect and handle them.
This pattern is perfect for:
- Identifying duplicate records
- Removing repeated values
- Cleaning datasets before analysis
- Ensuring accurate reporting
Basic Patterns
Using DISTINCT (remove duplicates)
SELECT DISTINCT column FROM table_name;
Using GROUP BY (find duplicates)
SELECT column,
COUNT(*) AS CountRows
FROM table_name
GROUP BY column
HAVING COUNT(*) > 1;
Using ROW_NUMBER (identify duplicates)
SELECT *,
ROW_NUMBER() OVER (PARTITION BY column ORDER BY column) AS row_num
FROM table_name;
Example 1: Remove duplicate programs
SELECT DISTINCT program FROM students;
This returns each program only once.
Example 2: Find duplicate departments in courses
SELECT department,
COUNT(*) AS CountRows
FROM courses
GROUP BY department
HAVING COUNT(*) > 1;
This shows departments that appear more than once.
Example 3: Identify duplicate enrollments by student
SELECT student_id,
course_id,
ROW_NUMBER() OVER (PARTITION BY student_id ORDER BY course_id) AS row_num
FROM enrollments;
This assigns a number to each row within a group.
Example 4: Keep only the first record per student
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY student_id ORDER BY enrollment_id) AS row_num
FROM enrollments
) AS t
WHERE row_num = 1;
This removes duplicate rows by keeping only the first one.
Why Use Duplicate Detection and Removal?
- It ensures data accuracy
- It prevents double counting
- It cleans messy datasets
- It prepares data for analysis
Duplicate Detection and Removal Summary
| Feature | Description |
|---|---|
| DISTINCT | Removes duplicate values |
| GROUP BY | Identifies duplicates using counts |
| ROW_NUMBER | Labels rows to isolate duplicates |
| Best for | Data cleaning and validation |
Duplicate handling is like checking a list for repeated names and deciding which ones to keep.
Practice Tasks (Your Turn!)
Task 1: Students Table
Show each program only once.
Task 2: Courses Table
Find departments that appear more than once.
Task 3: Enrollments Table
Assign a row number to each enrollment per student.
Click a task from Practice Tasks to begin.

