Process¶
Clean, transform, ensure integrity. Prepare data for analysis.
Data integrity is critical. Watch for data replication. Manipulate to make data more organized and easier to read. The data engineering team owns the data warehouse.
See also
Detailed cleaning patterns live in Data Cleaning — the same phase, more depth.
Why this phase matters¶
Most analyst time is spent here, not in modeling. Cleaning is unglamorous but it's where trust comes from. A model trained on dirty data misleads with confidence.
Pre-cleaning steps¶
- Backup the raw data — never edit the original
- Determine data integrity — accuracy, consistency, completeness
- Connect objectives to data — how do business goals map to the dataset?
- Set thresholds — what counts as "clean enough"?
- Know when to stop collecting
Data integrity checks¶
| Check | What to verify |
|---|---|
| Completeness | All required fields populated? |
| Uniqueness | No duplicate primary keys? |
| Consistency | Same value formatted the same across rows? |
| Validity | Values within expected range / types? |
| Accuracy | Match source of truth? |
| Timeliness | Is data current? |
Objective ↔ Data¶
- Clean data + business alignment = accurate conclusion
- Alignment + new variables + constraints = accurate conclusion
Spreadsheet helpers:
VLOOKUP— check column values against a referenceXLOOKUP— modern replacement; left-to-right not requiredDATEDIF— difference between two date columns
Insufficient Data¶
| Type | Solution |
|---|---|
| Single source | Talk to stakeholders; adjust objective |
| Updating data | Identify trends with what's available |
| Outdated | Find a newer dataset |
| Geographically limited | Wait for more data if time allows |
| Sample too small | Increase n; use bootstrap; widen CI |
Sample Size¶
- Use ≥ 30 when possible (Central Limit Theorem)
- Larger n → narrower CI, smaller margin of error
- Larger n → higher cost
- Sampling bias — sample not representative
- Random sampling — equal chance → more representative
- Confidence level — usually 95% (sometimes 90%)
Calculators:
Choose your tools¶
| Volume / complexity | Tool |
|---|---|
| < 10K rows, ad-hoc | Google Sheets / Excel |
| < 1M rows, scripted | Python (pandas) |
| 1M – 100M rows | DuckDB, polars, SQL |
| > 100M rows | BigQuery, Snowflake, Spark |
| Repeat pipelines | dbt, Airflow, Prefect |
Data-cleaning Checklist¶
See Data Cleaning for full techniques. Quick version:
- Determine size — large datasets have more issues; affects time/tools
- Categories/labels — diversity of dataset
- Missing data — find nulls; plan remediation
- Unformatted data — check formats
- Data types — pick appropriate methods
Document the cleaning process¶
Keep a change log so you (and others) can reproduce results later.
# Cleaning log — orders dataset
2026-03-15
- Loaded raw `orders_2026Q1.csv` (152,433 rows)
- Removed 412 duplicate `order_id`s (kept latest by `created_at`)
- Cast `total` to numeric; 18 rows had `'$' '` prefix → stripped, recast
- Dropped 92 test orders where `email` ends with '@example.test'
- Filled `country` nulls with `'unknown'` (37 rows)
- Output: `orders_2026Q1_clean.parquet` (151,892 rows)
Checklist¶
- Backup made
- Errors identified and documented
- Tools chosen
- Data transformed for analysis
- Cleaning log written
- Deliverable: documentation of any cleaning or manipulation