Finding duplicate rows in a table can be done easily by using ROW_NUMBER() function. You can number each similar rows by using PARTITION BY clause.
For example, consider the below table:
EmployeeID FirstName MiddleName LastName
———– ———- ———– ———
1 Ken J Sánchez
2 Terri Lee Duffy
3 Gail A Erickson
4 Ken J Sánchez
5 Terri Lee Duffy
6 Gail A Erickson
(6 row(s) affected)
Selecting using DISTINCT does not help here, as we have an identity column [EmployeeID] which id unique for all rows. To find duplicates here we need to compare only common columns [FirstName], [MiddleName] and [LastName]. We can partition data based on these columns for row numbering as below:
SELECT EmployeeID, FirstName, MiddleName, LastName,
ROW_NUMBER() OVER
(PARTITION BY FirstName, MiddleName, LastName
ORDER BY EmployeeID)
AS 'Row Number'
FROM dbo.Table_Employees
Result Set:
EmployeeID FirstName MiddleName LastName Row Number
———– ———- ———– ——— ———–
3 Gail A Erickson 1
6 Gail A Erickson 2
1 Ken J Sánchez 1
4 Ken J Sánchez 2
2 Terri Lee Duffy 1
5 Terri Lee Duffy 2
(6 row(s) affected)
Now we have both unique and duplicate rows from dbo.Table_Employees. All rows with [Row Number] = 1 are unique, and all other rows are duplicate rows. Still we cannot identify that if a row is duplicate, which is a corresponding unique row for the same (except by looking at it , which may not be feasible if there are a large number of rows).
To easily identify a unique row corresponding to a duplicate row, we would like to add another column which gives us the [EmployeeID] of the unique row. In the above example, row with [EmployeeID] = 6 is duplicate row, and we would like to add a column which contains value 3 ([EmployeeID] of corresponding unique row). So the row should look like below:
EmployeeID FirstName MiddleName LastName Row Number Duplicate Of
———– ———- ———– ——— ———– ————-
6 Gail A Erickson 2 3
This can be achieved by:
1. Move all unique rows to TableA (all rows where [RowNumber] = 1),
2. Move all duplicate rows to TableB (all rows where [RowNumber] <> 1),
3. JOIN TableA with TableB to get value for [Duplicate Of].
We don’t need to create actual separate tables here, we can use CTE instead. Final query will be as follows:
WITH CTE_Employees
(EmployeeID, FirstName, MiddleName, LastName, RowNumber)
AS
(
SELECT EmployeeID, FirstName, MiddleName, LastName,
ROW_NUMBER() OVER
(PARTITION BY FirstName, MiddleName, LastName
ORDER BY EmployeeID)
FROM Table_Employees
)
SELECT TableB.EmployeeID, TableB.FirstName, TableB.MiddleName,
TableB.LastName, TableA.EmployeeID AS 'Duplicate Of'
FROM CTE_Employees TableB
INNER JOIN CTE_Employees TableA
ON TableB.FirstName = TableA.FirstName
AND TableB.MiddleName = TableA.MiddleName
AND TableB.LastName = TableA.LastName
WHERE TableB.RowNumber > 1 – Duplicate rows
AND TableA.RowNumber = 1 – Unique rows
Result Set:
EmployeeID FirstName MiddleName LastName Duplicate Of
———– ———- ———– ——— ————-
6 Gail A Erickson 3
4 Ken J Sánchez 1
5 Terri Lee Duffy 2
(3 row(s) affected)
Hope This Helps!
Vishal
If you like this post, do like my Facebook Page -> SqlAndMe
EMail me your questions -> Vishal@SqlAndMe.com
Follow me on Twitter -> @SqlAndMe
Filed under: Common Table Expressions, SQLServer