SQL Book

SQL script for data cleaning peoples names to be the correct case

Introduction

When writing Extract, Transform and Load (ETL) routines for a data warehouse there may be times when you are importing data about customers, employees, subscribers etc which includes their name. People may have entered the names into the data sources in different formats, e.g. all upper case, all lower case, capitalised first letter, mixed case etc.

This article demonstrates a data cleaning SQL script that 'repairs' these different name formats into a uniform format of first letter of each part of the name capitalised as well as capitalised letters following a dash (for hyphenated names), spaces (for complete names or middle names) and apostrophes (for O'Neil, O'Murphy etc).

The script uses a number of SQL String Functions to achieve this.

As this script will be used in our data transformation routines we will wrap it in a User Defined Function (UDF) so that we can easily apply it from a SELECT statement. Placing the code in a UDF has the additional benefits of the code being pre-compiled and in a centralised location enabling easy maintenance / debugging of the code.

The complete function definition can be found below. The UDF is called CleanNameCase and takes one input parameter (the name to be cleaned) and returns the cleaned name as a varchar(30).

CREATE FUNCTION [dbo].[CleanNameCase] (
 @Name varchar(30)  
)
RETURNS varchar(30)
AS  
BEGIN 
 -- first trim the name
 SET @Name = LTRIM(RTRIM(@Name))
 -- declare variable to hold the reset name
 DECLARE @Reset varchar(30)
 SET @Reset = ''
 
 If @Name <> '' 
 BEGIN
  -- declare and assign variables that will be used to
  -- loop through each character in the name 
  DECLARE @CharCount int, @LoopCount int
  SET @LoopCount = 1
  SET @CharCount = Len(@Name)
  
  -- should the next character we append be upper case?
  -- first character is always upper case.
  DECLARE @MakeUpper bit 
  SET @MakeUpper = 1

  WHILE @LoopCount <= @CharCount
  BEGIN
   DECLARE @Character char
   SET @Character = Substring(@Name, @LoopCount, 1)
   
   -- append this character to the value we will return
   IF @MakeUpper  = 1 
    SET @Reset  = @Reset + UPPER(@Character)
   ELSE
    SET @Reset = @Reset + LOWER(@Character)
   
   -- work out if the next character should be upper case
   SELECT @MakeUpper = CASE 
    WHEN @Character = '-' THEN  1 
    WHEN @Character = ' ' THEN  1
    WHEN @Character = '''' THEN 1
    ELSE 0   
   END
   
   -- increment the loop counter
   SET @LoopCount = @LoopCount + 1
  END
 END 
 ELSE
 BEGIN
  SET @Reset = ''
 END

 RETURN @Reset
END

A key point of the above script is that a zero length string must be assigned to the @Reset variable before entering the loop. This is because otherwise the line SET @Reset = @Reset + UPPER(@Character) would leave @Reset with a value of NULL as NULL + a character = NULL.

The CASE statement is used to determine whethe the next letter in the name should be upper case or not, you could add more characters to this list if you can think of any others that might require a following upper case letter.

To call this user function in your ETL routine you could use something similar to below. Note you must use the 2 part name when calling a UDF i.e. [owner].[functionname]

INSERT INTO Stage_CleanedCustomers (Title, FirstName,
LastName, CustomerID, DataSource)
SELECT [Title], dbo.CleanNameCase(FirstName),
dbo.CleanNameCase(LastName), CustomerID, 'WebSite' 
FROM WebSiteCustomers

Conclusion

In this article we have demonstrated a script that can be applied to name data from has been extracted from different sources that may / may not use standardised data entry techniques. We hope this SQL script proves useful in your ETL routines and gives you ideas for other such 'data cleaning' scripts that may be required due to inconsistencies in approaches to data entry.

Join our newsletter

Subscribe to our free weekly newsletter and get exclusive SQL tips and articles straight to your inbox. 

 Your email address is only used to send you our newsletter.