Data Cleaning¶

Detailed techniques for the Process phase. Make data trustworthy before analyzing it.

Why clean¶

Dirty data = incomplete, incorrect, irrelevant
Clean data + business alignment = accurate conclusion
Dirty data poisons every downstream chart, model, and decision

Common issues¶

Issue	Description
Duplicate data	Same record appears more than once
Outdated data	Old values that should be replaced
Incomplete data	Missing fields
Inconsistent data	Same thing in different formats
Null	Empty fields
Truncated data	Cut off mid-value
Mistyped numbers	Stored as text, wrong precision
Inconsistent dates	Mix of `MM/DD/YYYY` and `YYYY-MM-DD`
Misspellings	"USA", "U.S.A.", "United States"

Tidy data principles (Wickham)¶

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

IQR — outlier detection (1.5 rule)¶

The most common outlier rule.

Compute Q1 (25^th percentile) and Q3 (75^th percentile)
IQR = Q3 − Q1
Lower bound = Q1 − 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR
Anything outside is an outlier

import pandas as pd

q1, q3 = df['salary'].quantile([0.25, 0.75])
iqr = q3 - q1
mask = (df['salary'] < q1 - 1.5 * iqr) | (df['salary'] > q3 + 1.5 * iqr)
outliers = df[mask]
clean = df[~mask]

WITH bounds AS (
  SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS q1,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS q3
  FROM employees
)
SELECT *
FROM employees, bounds
WHERE salary < q1 - 1.5 * (q3 - q1)
   OR salary > q3 + 1.5 * (q3 - q1);

Z-score outlier detection¶

Flag where |Z| > 3 (rare for normal data).

from scipy import stats
z = stats.zscore(df['salary'])
outliers = df[abs(z) > 3]

Missing data strategies¶

Strategy	When to use
Drop rows	Few missing, MCAR (missing completely at random)
Drop columns	Column is mostly empty
Mean / median impute	Numeric, low % missing
Mode impute	Categorical, low % missing
Forward / backward fill	Time series
Predictive imputation	High value, large dataset (KNN, MICE)
Flag with sentinel	When missingness is informative

Frequently used cleaning functions¶

SQLPython (pandas)

INSERT INTO ...
UPDATE ... SET ...
SELECT DISTINCT col FROM tbl;
LENGTH(col)
SUBSTR(col, 1, 5)
TRIM(col)
LOWER(col) / UPPER(col)
CAST(col AS INT)
CONCAT(a, b)
COALESCE(col, 'default')
NULLIF(col, '')
REGEXP_REPLACE(col, '\s+', ' ', 'g')

df.drop_duplicates(subset=['email'])
df.dropna(subset=['email'])
df['email'] = df['email'].str.lower().str.strip()
df['amount'] = df['amount'].astype(float)
df['date']   = pd.to_datetime(df['date'], errors='coerce')
df['country'] = df['country'].fillna('unknown')
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)
df = df.rename(columns={'old':'new'})

SQL cleaning patterns¶

Remove duplicates¶

WITH ranked AS (
  SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) AS rn
  FROM users
)
DELETE FROM users WHERE id IN (SELECT id FROM ranked WHERE rn > 1);

Normalize strings¶

UPDATE customers
SET email = LOWER(TRIM(email)),
    name  = TRIM(REGEXP_REPLACE(name, '\s+', ' ', 'g'));

Handle NULLs¶

SELECT COALESCE(phone, 'unknown') FROM users;
SELECT NULLIF(TRIM(phone), '') FROM users;
SELECT * FROM users WHERE email IS NOT NULL;

Audit before destructive changes¶

BEGIN;
-- Count what you'll affect
SELECT COUNT(*) FROM users WHERE email IS NULL;
-- Sample
SELECT * FROM users WHERE email IS NULL LIMIT 10;
-- Apply
UPDATE users SET email = 'unknown' WHERE email IS NULL;
-- Verify
SELECT COUNT(*) FROM users WHERE email = 'unknown';
-- COMMIT; or ROLLBACK;

Python (pandas) cleaning patterns¶

Inspect first¶

df.info()
df.describe(include='all')
df.isna().sum().sort_values(ascending=False)
df.duplicated().sum()
df.head(); df.sample(5)

Strings¶

df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.lower()
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)
df = df[df['email'].str.contains('@', na=False)]

Types¶

df['amount'] = df['amount'].astype(float)
df['date']   = pd.to_datetime(df['date'], errors='coerce')
df['cat']    = df['cat'].astype('category')

Outliers (IQR)¶

q1, q3 = df['price'].quantile([0.25, 0.75])
iqr = q3 - q1
df = df[(df['price'] >= q1 - 1.5 * iqr) & (df['price'] <= q3 + 1.5 * iqr)]

Group + aggregate¶

df.groupby('country').agg(
    revenue=('amount', 'sum'),
    customers=('user_id', 'nunique'),
    avg_order=('amount', 'mean'),
)

Validation with Pandera¶

import pandera as pa
from pandera.typing import Series

class Schema(pa.DataFrameModel):
    user_id: Series[int]    = pa.Field(unique=True)
    email:   Series[str]    = pa.Field(str_matches=r'.+@.+\..+')
    age:     Series[int]    = pa.Field(ge=0, le=120, nullable=True)

Schema.validate(df)

Data-cleaning checklist¶

Make a backup before cleaning
Determine size — affects time/tools
Categories/labels — understand diversity
Identify missing data — plan remediation
Identify unformatted data — ensure uniformity
Explore data types — pick appropriate methods
Check spelling, misfielded values, duplicates
Document errors

Data verification checklist¶

Workflow automation¶

Great Expectations — data validation framework
dbt tests — schema and data tests in SQL
Pandera — DataFrame schema validation
Towards Data Science — Automating analysis

Data Cleaning¶

Why clean¶

Common issues¶

Tidy data principles (Wickham)¶

IQR — outlier detection (1.5 rule)¶

Z-score outlier detection¶

Missing data strategies¶

Frequently used cleaning functions¶

SQL cleaning patterns¶

Remove duplicates¶

Normalize strings¶

Handle NULLs¶

Audit before destructive changes¶

Python (pandas) cleaning patterns¶

Inspect first¶

Strings¶

Types¶

Outliers (IQR)¶

Group + aggregate¶

Validation with Pandera¶

Data-cleaning checklist¶

Data verification checklist¶

Workflow automation¶

References¶