Data Cleaning¶
Detailed techniques for the Process phase. Make data trustworthy before analyzing it.
Why clean¶
- Dirty data = incomplete, incorrect, irrelevant
- Clean data + business alignment = accurate conclusion
- Dirty data poisons every downstream chart, model, and decision
Common issues¶
| Issue | Description |
|---|---|
| Duplicate data | Same record appears more than once |
| Outdated data | Old values that should be replaced |
| Incomplete data | Missing fields |
| Inconsistent data | Same thing in different formats |
| Null | Empty fields |
| Truncated data | Cut off mid-value |
| Mistyped numbers | Stored as text, wrong precision |
| Inconsistent dates | Mix of MM/DD/YYYY and YYYY-MM-DD |
| Misspellings | "USA", "U.S.A.", "United States" |
Tidy data principles (Wickham)¶
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
IQR — outlier detection (1.5 rule)¶
The most common outlier rule.
- Compute Q1 (25th percentile) and Q3 (75th percentile)
- IQR = Q3 − Q1
- Lower bound = Q1 − 1.5 × IQR
- Upper bound = Q3 + 1.5 × IQR
- Anything outside is an outlier
import pandas as pd
q1, q3 = df['salary'].quantile([0.25, 0.75])
iqr = q3 - q1
mask = (df['salary'] < q1 - 1.5 * iqr) | (df['salary'] > q3 + 1.5 * iqr)
outliers = df[mask]
clean = df[~mask]
WITH bounds AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS q3
FROM employees
)
SELECT *
FROM employees, bounds
WHERE salary < q1 - 1.5 * (q3 - q1)
OR salary > q3 + 1.5 * (q3 - q1);
Z-score outlier detection¶
Flag where |Z| > 3 (rare for normal data).
Missing data strategies¶
| Strategy | When to use |
|---|---|
| Drop rows | Few missing, MCAR (missing completely at random) |
| Drop columns | Column is mostly empty |
| Mean / median impute | Numeric, low % missing |
| Mode impute | Categorical, low % missing |
| Forward / backward fill | Time series |
| Predictive imputation | High value, large dataset (KNN, MICE) |
| Flag with sentinel | When missingness is informative |
Frequently used cleaning functions¶
df.drop_duplicates(subset=['email'])
df.dropna(subset=['email'])
df['email'] = df['email'].str.lower().str.strip()
df['amount'] = df['amount'].astype(float)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['country'] = df['country'].fillna('unknown')
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)
df = df.rename(columns={'old':'new'})
SQL cleaning patterns¶
Remove duplicates¶
WITH ranked AS (
SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) AS rn
FROM users
)
DELETE FROM users WHERE id IN (SELECT id FROM ranked WHERE rn > 1);
Normalize strings¶
UPDATE customers
SET email = LOWER(TRIM(email)),
name = TRIM(REGEXP_REPLACE(name, '\s+', ' ', 'g'));
Handle NULLs¶
SELECT COALESCE(phone, 'unknown') FROM users;
SELECT NULLIF(TRIM(phone), '') FROM users;
SELECT * FROM users WHERE email IS NOT NULL;
Audit before destructive changes¶
BEGIN;
-- Count what you'll affect
SELECT COUNT(*) FROM users WHERE email IS NULL;
-- Sample
SELECT * FROM users WHERE email IS NULL LIMIT 10;
-- Apply
UPDATE users SET email = 'unknown' WHERE email IS NULL;
-- Verify
SELECT COUNT(*) FROM users WHERE email = 'unknown';
-- COMMIT; or ROLLBACK;
Python (pandas) cleaning patterns¶
Inspect first¶
df.info()
df.describe(include='all')
df.isna().sum().sort_values(ascending=False)
df.duplicated().sum()
df.head(); df.sample(5)
Strings¶
df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.lower()
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)
df = df[df['email'].str.contains('@', na=False)]
Types¶
df['amount'] = df['amount'].astype(float)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['cat'] = df['cat'].astype('category')
Outliers (IQR)¶
q1, q3 = df['price'].quantile([0.25, 0.75])
iqr = q3 - q1
df = df[(df['price'] >= q1 - 1.5 * iqr) & (df['price'] <= q3 + 1.5 * iqr)]
Group + aggregate¶
df.groupby('country').agg(
revenue=('amount', 'sum'),
customers=('user_id', 'nunique'),
avg_order=('amount', 'mean'),
)
Validation with Pandera¶
import pandera as pa
from pandera.typing import Series
class Schema(pa.DataFrameModel):
user_id: Series[int] = pa.Field(unique=True)
email: Series[str] = pa.Field(str_matches=r'.+@.+\..+')
age: Series[int] = pa.Field(ge=0, le=120, nullable=True)
Schema.validate(df)
Data-cleaning checklist¶
- Make a backup before cleaning
- Determine size — affects time/tools
- Categories/labels — understand diversity
- Identify missing data — plan remediation
- Identify unformatted data — ensure uniformity
- Explore data types — pick appropriate methods
- Check spelling, misfielded values, duplicates
- Document errors
Data verification checklist¶
- Sources of errors identified with the right tools
- Nulls scanned (filters, conditional formatting)
- Misspellings located
- Mistyped numbers double-checked
- Extra spaces / characters removed (
TRIM) - Duplicates removed
- Mismatched data types recast
- Inconsistent strings normalized
- Date formats consistent
- Column names meaningful
- Truncated data identified
- Business logic sanity check
Workflow automation¶
- Great Expectations — data validation framework
- dbt tests — schema and data tests in SQL
- Pandera — DataFrame schema validation
- Towards Data Science — Automating analysis