Process¶

Clean, transform, ensure integrity. Prepare data for analysis.

Data integrity is critical. Watch for data replication. Manipulate to make data more organized and easier to read. The data engineering team owns the data warehouse.

Why this phase matters¶

Most analyst time is spent here, not in modeling. Cleaning is unglamorous but it's where trust comes from. A model trained on dirty data misleads with confidence.

Pre-cleaning steps¶

Backup the raw data — never edit the original
Determine data integrity — accuracy, consistency, completeness
Connect objectives to data — how do business goals map to the dataset?
Set thresholds — what counts as "clean enough"?
Know when to stop collecting

Data integrity checks¶

Check	What to verify
Completeness	All required fields populated?
Uniqueness	No duplicate primary keys?
Consistency	Same value formatted the same across rows?
Validity	Values within expected range / types?
Accuracy	Match source of truth?
Timeliness	Is data current?

Objective ↔ Data¶

Clean data + business alignment = accurate conclusion
Alignment + new variables + constraints = accurate conclusion

Spreadsheet helpers:

VLOOKUP — check column values against a reference
XLOOKUP — modern replacement; left-to-right not required
DATEDIF — difference between two date columns

Insufficient Data¶

Type	Solution
Single source	Talk to stakeholders; adjust objective
Updating data	Identify trends with what's available
Outdated	Find a newer dataset
Geographically limited	Wait for more data if time allows
Sample too small	Increase n; use bootstrap; widen CI

Sample Size¶

Use ≥ 30 when possible (Central Limit Theorem)
Larger n → narrower CI, smaller margin of error
Larger n → higher cost
Sampling bias — sample not representative
Random sampling — equal chance → more representative
Confidence level — usually 95% (sometimes 90%)

Calculators:

Choose your tools¶

Volume / complexity	Tool
< 10K rows, ad-hoc	Google Sheets / Excel
< 1M rows, scripted	Python (pandas)
1M – 100M rows	DuckDB, polars, SQL
> 100M rows	BigQuery, Snowflake, Spark
Repeat pipelines	dbt, Airflow, Prefect

Data-cleaning Checklist¶

See Data Cleaning for full techniques. Quick version:

Determine size — large datasets have more issues; affects time/tools
Categories/labels — diversity of dataset
Missing data — find nulls; plan remediation
Unformatted data — check formats
Data types — pick appropriate methods

Document the cleaning process¶

Keep a change log so you (and others) can reproduce results later.

# Cleaning log — orders dataset
2026-03-15
- Loaded raw `orders_2026Q1.csv` (152,433 rows)
- Removed 412 duplicate `order_id`s (kept latest by `created_at`)
- Cast `total` to numeric; 18 rows had `'$' '` prefix → stripped, recast
- Dropped 92 test orders where `email` ends with '@example.test'
- Filled `country` nulls with `'unknown'` (37 rows)
- Output: `orders_2026Q1_clean.parquet` (151,892 rows)

Checklist¶

References¶

Top 10 ways to clean data — Microsoft
Google Workspace — Cleaning Tips
Tidy Data — Wickham
pandas — Working with missing data
Great Expectations — data validation framework