Thursday, 27 March 2014

Using Rank() to Remove Duplicates

In a recent post on SQL Server Window Functions I looked at "scoring" rows of data based on the value of specified columns.

The same approach can be used to help cleanse data prior to, or as part of ETL processes.  Data issues should be corrected at source, but in the real world this isn't always possible.  The source may be external, there may be issues of data ownership, or there simply may not be time to carry out the process.

It often falls to the SQL Developer to handle these issues. Here I will look at handling the issue of two records in a table for the same entity but with slightly different spelling (I refer to this as  "duplicate" records, though the records are not exact copies).

I've encountered this problem several times at different organisations.  If the two records were identical, it would simply be a case of using a SELECT DISTINCT. You might also consider imposing a unique constraint to ensure there are no actual duplicates.  But the spelling difference means that they are regarded as two completely different entities by the code.

When presenting this to business users, the response has often been "Just take the first one".  But how do you actually just take the first one? And first one based on what exactly?

First let's create some data:   


CREATE TABLE Stones
(
ID          INT IDENTITY(1,1),
FirstName   VARCHAR(10),
LastName    VARCHAR(20)
)

INSERT INTO Stones(FirstName, LastName) 
VALUES ('Mick','Jagger')
INSERT INTO Stones (FirstName, LastName) 
VALUES ('Michael','Jagger')
INSERT INTO Stones(FirstName, LastName) 
VALUES ('Kieth','Richards')
INSERT INTO Stones(FirstName, LastName) 
VALUES ('Ronnie','Wood')
INSERT INTO Stones(FirstName, LastName) 
VALUES ('Charlie','Watts')



SELECT * FROM Stones

ID          FirstName  LastName
----------- ---------- --------------------
1           Mick       Jagger
2           Michael    Jagger
3           Kieth      Richards
4           Ronnie     Wood
5           Charlie    Watts

(5 row(s) affected)

Note IDs 1 and 2 for Mick and Michael Jagger.  These are actually the same person, and we only want to pick up the first one.  We only want 4 rows in our result.

We can use the RANK() function to now score these records, partitioning by LastName (see here for more on how to do this):

  SELECT ID, 
       FirstName, 
       LastName,
       RANK() OVER (PARTITION BY LastName ORDER BY ID ASC) AS Score
  FROM Stones ORDER BY 1

ID          FirstName  LastName             Score
----------- ---------- -------------------- --------------------
1           Mick       Jagger               1
2           Michael    Jagger               2
3           Kieth      Richards             1
4           Ronnie     Wood                 1
5           Charlie    Watts                1

(5 row(s) affected)


I have ordered by ID, on the assumption that "Take the first one" means the first one to enter the table (the seeding of the table I have therefore used as the indicator of "first". If there are timestamps, they can be used instead.) 

Examining the results of the query above shows that we now only require records with a score of 1.  It is then quite a simple process to use the above code in a sub-query with a WHERE clause to restrict to the rows we want:

SELECT * FROM
(
  SELECT ID, 
       FirstName, 
       LastName,
       RANK() OVER (PARTITION BY LastName ORDER BY ID ASC) AS Score
  FROM Stones
)a
WHERE Score = 1

ID          FirstName  LastName             Score
----------- ---------- -------------------- --------------------
1           Mick       Jagger               1
3           Kieth      Richards             1
5           Charlie    Watts                1
4           Ronnie     Wood                 1

(4 row(s) affected)


We are now left only with the four rows we want. All duplicates have been excluded.


No comments:

Post a Comment