Prepare
Collect, organize, validate. Decide what data is needed and how to get it.
Why this phase matters
The Prepare phase controls everything downstream. Bad sourcing means bad analysis no matter how clever the model. Garbage in, garbage out.
Data Collection Considerations
| Question |
Why it matters |
| How much data? |
Affects sample size and analysis depth |
| What kind (qual/quant)? |
Determines methods |
| First/second/third party? |
Trust and access constraints |
| Time scope? |
Recency vs. trend |
| Continuous or one-shot? |
Pipeline vs. ad-hoc |
| Cost to acquire? |
Budget vs. precision tradeoff |
| Refresh cadence? |
Real-time vs. daily vs. monthly |
ROCCC — good data
| Letter |
Meaning |
Bad data |
| Reliable |
Accurate, complete, unbiased |
Selection bias, misleading graphs |
| Original |
First-party / verified source |
Secondary or third-party |
| Comprehensive |
Covers needed scope |
Partial / missing fields |
| Current |
Up to date |
Outdated, irrelevant |
| Cited |
From credible, citable source |
Uncited |
Data parties
| Party |
Description |
Trust level |
| First-party |
Collected by your organization directly |
Highest |
| Second-party |
First-party data shared by a partner |
Medium |
| Third-party |
Aggregated or purchased |
Lower; verify |
Data Ethics
- Ownership — who owns the data
- Transaction transparency — clear collection/use
- Consent — informed agreement
- Currency — up-to-date status
- Privacy — preserve individual data subjects; permission and control
- Openness — free access, usage, sharing
- Compliance — GDPR, CCPA, HIPAA, SOX, PCI-DSS as applicable
Data Types
| Type |
Description |
Example |
| Numerical |
Numbers, measurable |
Quantity, price, age |
| Text / String |
Letters, identifiers (no math) |
Product name, address, ID |
| Date |
Point in time |
Order date, birth date |
| Categorical |
Discrete groups |
Color (red/blue), tier (gold/silver) |
| Quantitative |
Measurable |
Rating, upvote |
| Qualitative |
Not measurable |
Review text, interview |
| Metadata |
Data about data |
Schema, lineage, ownership |
| Long (tidy) |
Wide |
| One observation per row |
Multiple observations per row |
| One variable per column |
Variables spread across columns |
| Easier for grouping, plotting |
Easier for human reading |
pd.melt() to convert wide → long |
pd.pivot() to convert long → wide |
Data Life Cycle
- Plan — what data, who manages
- Capture — collect from sources
- Manage — store, secure, govern
- Analyze — derive insight
- Archive — long-term storage
- Destroy — delete shared copies
Databases (relational)
erDiagram
CUSTOMER ||--o{ ORDER : places
ORDER ||--|{ LINE-ITEM : contains
CUSTOMER }|..|{ DELIVERY-ADDRESS : uses
| Concept |
Description |
| Primary key |
Unique value per row |
| Foreign key |
Reference to PK in another table |
| Index |
Speeds lookups on a column |
| Constraint |
NOT NULL, UNIQUE, CHECK |
| Schema |
Description of structure (tables, columns, types) |
File Organization
- Folder depth ≤ 5
- Pick a style and stick to it
- File names should include:
- Project name
- Creation date (YYYY_MM_DD)
- Revision version
- Consistent style and order
- Example:
SalesReport_2026_03_15_v02
Storage decisions
| Need |
Format / system |
| Quick share / collaborative |
Google Sheets, Airtable |
| Structured + queryable |
Postgres, MySQL, BigQuery |
| Large columnar / analytical |
Parquet, BigQuery, Snowflake, DuckDB |
| Raw + cheap |
S3 / GCS / Azure Blob |
| Real-time |
Kafka, Pub/Sub |
| Versioned files |
Git LFS, DVC, lakeFS |
Open data sources
See Resources → Datasets for the full list.
Sample size guidelines
- ≥ 30 rows from CLT (Central Limit Theorem) for stable mean
- Larger sample → narrower CI, smaller margin of error
- Stakes-driven: high-stakes decisions warrant larger samples
- For A/B tests, calculate required sample size up front using power analysis
Checklist
References