Prepare¶

Collect, organize, validate. Decide what data is needed and how to get it.

Why this phase matters¶

The Prepare phase controls everything downstream. Bad sourcing means bad analysis no matter how clever the model. Garbage in, garbage out.

Question	Why it matters
How much data?	Affects sample size and analysis depth
What kind (qual/quant)?	Determines methods
First/second/third party?	Trust and access constraints
Time scope?	Recency vs. trend
Continuous or one-shot?	Pipeline vs. ad-hoc
Cost to acquire?	Budget vs. precision tradeoff
Refresh cadence?	Real-time vs. daily vs. monthly

Letter	Meaning	Bad data
Reliable	Accurate, complete, unbiased	Selection bias, misleading graphs
Original	First-party / verified source	Secondary or third-party
Comprehensive	Covers needed scope	Partial / missing fields
Current	Up to date	Outdated, irrelevant
Cited	From credible, citable source	Uncited

Party	Description	Trust level
First-party	Collected by your organization directly	Highest
Second-party	First-party data shared by a partner	Medium
Third-party	Aggregated or purchased	Lower; verify

Type	Description	Example
Numerical	Numbers, measurable	Quantity, price, age
Text / String	Letters, identifiers (no math)	Product name, address, ID
Date	Point in time	Order date, birth date
Categorical	Discrete groups	Color (red/blue), tier (gold/silver)
Quantitative	Measurable	Rating, upvote
Qualitative	Not measurable	Review text, interview
Metadata	Data about data	Schema, lineage, ownership

Long (tidy)	Wide
One observation per row	Multiple observations per row
One variable per column	Variables spread across columns
Easier for grouping, plotting	Easier for human reading
`pd.melt()` to convert wide → long	`pd.pivot()` to convert long → wide

erDiagram
    CUSTOMER ||--o{ ORDER : places
    ORDER ||--|{ LINE-ITEM : contains
    CUSTOMER }|..|{ DELIVERY-ADDRESS : uses

Concept	Description
Primary key	Unique value per row
Foreign key	Reference to PK in another table
Index	Speeds lookups on a column
Constraint	NOT NULL, UNIQUE, CHECK
Schema	Description of structure (tables, columns, types)

Need	Format / system
Quick share / collaborative	Google Sheets, Airtable
Structured + queryable	Postgres, MySQL, BigQuery
Large columnar / analytical	Parquet, BigQuery, Snowflake, DuckDB
Raw + cheap	S3 / GCS / Azure Blob
Real-time	Kafka, Pub/Sub
Versioned files	Git LFS, DVC, lakeFS

See Resources → Datasets for the full list.