Thursday 27 March 2014

Using Rank() to Remove Duplicates

In a recent post on SQL Server Window Functions I looked at "scoring" rows of data based on the value of specified columns.

The same approach can be used to help cleanse data prior to, or as part of ETL processes.  Data issues should be corrected at source, but in the real world this isn't always possible.  The source may be external, there may be issues of data ownership, or there simply may not be time to carry out the process.

It often falls to the SQL Developer to handle these issues. Here I will look at handling the issue of two records in a table for the same entity but with slightly different spelling (I refer to this as  "duplicate" records, though the records are not exact copies).

I've encountered this problem several times at different organisations.  If the two records were identical, it would simply be a case of using a SELECT DISTINCT. You might also consider imposing a unique constraint to ensure there are no actual duplicates.  But the spelling difference means that they are regarded as two completely different entities by the code.

When presenting this to business users, the response has often been "Just take the first one".  But how do you actually just take the first one? And first one based on what exactly?

First let's create some data:   


CREATE TABLE Stones
(
ID          INT IDENTITY(1,1),
FirstName   VARCHAR(10),
LastName    VARCHAR(20)
)

INSERT INTO Stones(FirstName, LastName) 
VALUES ('Mick','Jagger')
INSERT INTO Stones (FirstName, LastName) 
VALUES ('Michael','Jagger')
INSERT INTO Stones(FirstName, LastName) 
VALUES ('Kieth','Richards')
INSERT INTO Stones(FirstName, LastName) 
VALUES ('Ronnie','Wood')
INSERT INTO Stones(FirstName, LastName) 
VALUES ('Charlie','Watts')



SELECT * FROM Stones

ID          FirstName  LastName
----------- ---------- --------------------
1           Mick       Jagger
2           Michael    Jagger
3           Kieth      Richards
4           Ronnie     Wood
5           Charlie    Watts

(5 row(s) affected)

Note IDs 1 and 2 for Mick and Michael Jagger.  These are actually the same person, and we only want to pick up the first one.  We only want 4 rows in our result.

We can use the RANK() function to now score these records, partitioning by LastName (see here for more on how to do this):

  SELECT ID, 
       FirstName, 
       LastName,
       RANK() OVER (PARTITION BY LastName ORDER BY ID ASC) AS Score
  FROM Stones ORDER BY 1

ID          FirstName  LastName             Score
----------- ---------- -------------------- --------------------
1           Mick       Jagger               1
2           Michael    Jagger               2
3           Kieth      Richards             1
4           Ronnie     Wood                 1
5           Charlie    Watts                1

(5 row(s) affected)


I have ordered by ID, on the assumption that "Take the first one" means the first one to enter the table (the seeding of the table I have therefore used as the indicator of "first". If there are timestamps, they can be used instead.) 

Examining the results of the query above shows that we now only require records with a score of 1.  It is then quite a simple process to use the above code in a sub-query with a WHERE clause to restrict to the rows we want:

SELECT * FROM
(
  SELECT ID, 
       FirstName, 
       LastName,
       RANK() OVER (PARTITION BY LastName ORDER BY ID ASC) AS Score
  FROM Stones
)a
WHERE Score = 1

ID          FirstName  LastName             Score
----------- ---------- -------------------- --------------------
1           Mick       Jagger               1
3           Kieth      Richards             1
5           Charlie    Watts                1
4           Ronnie     Wood                 1

(4 row(s) affected)


We are now left only with the four rows we want. All duplicates have been excluded.


Wednesday 5 March 2014

The database principal owns a schema in the database, and cannot be dropped (Error: 15138)

On a recent audit of a database I noticed that there were a lot of logins for a database that were not required.  I set about removing the logins one by one, but soon hit the following error:

Error: 15138 The database principal owns a schema in the database, and cannot be dropped.

Well this threw me a bit, and I could not figure out what was going on and why, this one particular login could not be removed.  As the message indicates, this account owns a schema on the DB. So I opened the user properties from SSMS and sure enough, the user  in question owned three schemas: db_datareader, db_datawriter, db_owner.  However, SSMS will not simply allow you to unselect the checkboxes here, so it is still not so easy to remove this link and allow the user to be deleted!

The schemas owned by the user can also be displayed by running the following query, within the context of the database under scrutiny:

SELECT s.name FROM sys.schemas s
WHERE s.principal_id = USER_ID('UserInQuestion'');

For a more complete picture of all schemas and users associated with the database:

SELECT *, USER_NAME(principal_id) as username
FROM sys.schemas;


In order to remove this user we have to first transfer ownership of these schemas to a different user.  Typically this would be dbo.  This has to be done once for each schema owned, using the ALTER AUTHORISATION ON SCHEMA command:

 ALTER AUTHORISATION ON SCHEMA::db_datareader TO dbo
 ALTER AUTHORISATION ON SCHEMA::db_datawriter TO dbo
 ALTER AUTHORISATION ON SCHEMA::db_owner TO dbo

(note the double colon!)

Re-run the previous scripts to ensure that the user in question no longer owns any schemas.  Now if we again try the delete again, this user will be removed successfully.