SQL Server – Identifying unique and duplicate rows in a table

Finding duplicate rows in a table can be done easily by using ROW_NUMBER() function. You can number each similar rows by using PARTITION BY clause.

For example, consider the below table:

EmployeeID FirstName MiddleName LastName

———– ———- ———– ———

1 Ken J Sánchez

2 Terri Lee Duffy

3 Gail A Erickson

4 Ken J Sánchez

5 Terri Lee Duffy

6 Gail A Erickson

(6 row(s) affected)

Selecting using DISTINCT does not help here, as we have an identity column [EmployeeID] which id unique for all rows. To find duplicates here we need to compare only common columns [FirstName], [MiddleName] and [LastName]. We can partition data based on these columns for row numbering as below:

SELECT EmployeeID, FirstName, MiddleName, LastName,

ROW_NUMBER() OVER

(PARTITION BY FirstName, MiddleName, LastName

ORDER BY EmployeeID)

AS 'Row Number'

FROM dbo.Table_Employees

Result Set:

EmployeeID FirstName MiddleName LastName Row Number

———– ———- ———– ——— ———–

3 Gail A Erickson 1

6 Gail A Erickson 2

1 Ken J Sánchez 1

4 Ken J Sánchez 2

2 Terri Lee Duffy 1

5 Terri Lee Duffy 2

(6 row(s) affected)

Now we have both unique and duplicate rows from dbo.Table_Employees. All rows with [Row Number] = 1 are unique, and all other rows are duplicate rows. Still we cannot identify that if a row is duplicate, which is a corresponding unique row for the same (except by looking at it , which may not be feasible if there are a large number of rows).

To easily identify a unique row corresponding to a duplicate row, we would like to add another column which gives us the [EmployeeID] of the unique row. In the above example, row with [EmployeeID] = 6 is duplicate row, and we would like to add a column which contains value 3 ([EmployeeID] of corresponding unique row). So the row should look like below:

EmployeeID FirstName MiddleName LastName Row Number Duplicate Of

———– ———- ———– ——— ———– ————-

6 Gail A Erickson 2 3

This can be achieved by:

1. Move all unique rows to TableA (all rows where [RowNumber] = 1),

2. Move all duplicate rows to TableB (all rows where [RowNumber] <> 1),

3. JOIN TableA with TableB to get value for [Duplicate Of].

We don’t need to create actual separate tables here, we can use CTE instead. Final query will be as follows:

WITH CTE_Employees

(EmployeeID, FirstName, MiddleName, LastName, RowNumber)

(

SELECT EmployeeID, FirstName, MiddleName, LastName,

ROW_NUMBER() OVER

(PARTITION BY FirstName, MiddleName, LastName

ORDER BY EmployeeID)

FROM Table_Employees

)

SELECT TableB.EmployeeID, TableB.FirstName, TableB.MiddleName,

TableB.LastName, TableA.EmployeeID AS 'Duplicate Of'

FROM CTE_Employees TableB

INNER JOIN CTE_Employees TableA

ON TableB.FirstName = TableA.FirstName

AND TableB.MiddleName = TableA.MiddleName

AND TableB.LastName = TableA.LastName

WHERE TableB.RowNumber > 1 – Duplicate rows

AND TableA.RowNumber = 1 – Unique rows

Result Set:

EmployeeID FirstName MiddleName LastName Duplicate Of

———– ———- ———– ——— ————-

6 Gail A Erickson 3

4 Ken J Sánchez 1

5 Terri Lee Duffy 2

(3 row(s) affected)

Hope This Helps!

Vishal

If you like this post, do like my Facebook Page -> SqlAndMe
EMail me your questions -> Vishal@SqlAndMe.com
Follow me on Twitter -> @SqlAndMe

Filed under: Common Table Expressions, SQLServer

Latest Images

Trending Articles

Latest Images