Show Menu
Cheatography

Cleaning Data with SQL Cheat Sheet (DRAFT) by

Summarizing the key SQL statements Missing data: WHERE IS NOT NULL WHERE IS NULL COUNT(*) GROUP BY ORDER BY String Functions: SUBSTRING(), LTRIM(), RTRIM(), LEFT(), RIGHT(), CHARINDEX(), LEN(), ISNUMERIC(), ISDATE() Set Operations EXCEPT­/MINUS INTERSECT UNION Dropping Duplicates row_number() OVER ( PARTITION BY WHERE > 1 Replace Through Regular Expressions CASE with LIKE

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Regular Expres­sions

\d
A digit
[]
Any item in brackets
?
previous character 0 or 1 times
+
Previous character 1 or more times
*
Previous character 0 or more times

Identi­fying Missing Data

Missing Completely at Random
Missing Not at Random
SELECT * FROM restau­ran­t_i­nsp­ection WHERE score IS NULL;
SELECT inspec­tio­n_type, COUNT(*) as count FROM restau­ran­t_i­nsp­ection WHERE score IS NULL
SELECT COUNT(*) FROM restau­ran­t_i­nsp­ection WHERE score IS NULL;
GROUP BY inspec­tio­n_type ORDER BY count D

Replacing null values with an average

UPDATE TABLE Patient
SET Income = (SELECT avg(Income ) FROM Patient)
WHERE Income IS NULL;

Dropping Values where a column is null

SELECT count(*)
FROM Patient
WHERE Weight IS NULL;

Dealing with Duplicates

WITH  DuplicateWithNumber   AS  
 ( SELECT  Column1, Column2 
  , row_number() OVER ( PARTITION BY Column1, Column2 
ORDER BY Column3) AS NthAppearance 
  FROM  YourTableName 
  ) 
  DELETE  FROM DuplicateWithNumber  
  WHERE  NthAppearance> 1

Cleaning with Case

UPDATE Project..housing
SET SoldAsVacant=CASE
	WHEN SoldAsVacant = 'Y' THEN 'Yes'
	WHEN SoldAsVacant = 'N' THEN 'No'
	ELSE SoldAsVacant
END
 

Replacing missing values with COALESCE()

The COALESCE() functiin substi­tutes null valuees
SELECT name, COALES­CE(­score, -1), inspec­tio­n_type FROM restau­ran­t_i­nsp­ection

SQL Set Operations

EXCEPT­/MINUS
INTERSECT
UNION
Return the rows that are found in one relation but not the other.
the rows that are found in both relations
e rows that are found in either relation.
Note that input relations must have the same number of columns and compatible data types for the respective columns.

Detecting out of range values

SELECT * FROM series
WHERE rating NOT BETWEEN 0 AND 10

SELECT * FROM series
WHERE rating < 0 OR rating > 10

Replicate: Repeats string x no of times

REPLICATE (string, integer)

REPLICATE('0', 9 - LEN(registration_code))

Checking if a value is a website

select
  case
     when email LIKE '%.com' then email
     else null
  end AS email
from pension_funds

Handy Numeric Functions

IS NUMERIC
Returns true if the expression has a numeric value, false if not
CONVERT
Converts a character string into a number
SELECT CONVER­T(int, 25.65);
TRUNCATE
n truncated to d decimal places. If you skip d, then n is truncated to 0 decimal places. If d is a negative number, the function truncates the number n to d digits left to the decimal point.
SELECT TRUNCA­TE(­123.45­67,2); 100
 

Column values must translate to type to stand

ALTER TABLE
tableName
ALTER COLUMN
columnName
TYPE smallint
USING column­Nam­e::­sma­llint

String Operations

-- SUBSTR

 syntax is:
 SUBSTR(string, starting character position, # of characters):
SELECT incidnt_num,
       date,
       SUBSTR(date, 4, 2) AS day
  FROM tutorial.sf_crime_incidents_2014_01

# Extracting date and time specifically from a column

SELECT incidnt_num,
       date,
       LEFT(date, 10) AS cleaned_date,
       RIGHT(date, 17) AS cleaned_time
  FROM tutorial.sf_crime_incidents_2014_01

# TRIM, to remove characters
 from the beginning and end of a string.
SELECT location,
       TRIM(both '()' FROM location)
  FROM tutorial.sf_crime_incidents_2014_01

# POSITION allows you to specify a substring,
 then returns a numerical value equal to the 
character number (counting from left) where that 
substring first appears in the target string.

SELECT incidnt_num,
       descript,
       POSITION('A' IN descript) AS a_position
  FROM tutorial.sf_crime_incidents_2014_01

# COALESCE can replace the null values

SELECT incidnt_num,
       descript,
       COALESCE(descript, 'No Description')
  FROM tutorial.sf_crime_incidents_cleandate
 ORDER BY descript DESC
Left for time
Right for date

Trim to remove chars at beginning and end

Position to return the location of a character!

COALESCE to replace the nulls!