Data Modeling¶
Data modeling is the process of structuring data to support business analysis and reporting. As a data analyst, you will often query modeled data (like Star Schemas) or create your own transformed tables (using tools like dbt).
Key Architectures¶
| Architecture | Description |
|---|---|
| Data Warehouse | Central repository of structured, cleansed data optimized for analytics (e.g., Snowflake, BigQuery). Uses schemas like Star or Snowflake. |
| Data Lake | Repository for raw, unstructured, or semi-structured data (e.g., S3, Google Cloud Storage). |
| Data Lakehouse | Combines the flexibility of a Data Lake with the management and structure of a Data Warehouse. |
The Star Schema¶
The most common data modeling approach for analytics, developed by Ralph Kimball.
Fact Tables¶
- What it is: Stores quantitative data for analysis (metrics, measurements) and foreign keys.
- Characteristics: Very large, deeply granular, appended frequently.
- Examples:
orders,pageviews,transactions.
Dimension Tables¶
- What it is: Stores descriptive attributes related to the facts. Used for filtering and grouping (
WHEREandGROUP BY). - Characteristics: Smaller, wider (many columns), updated less frequently.
- Examples:
customers,products,date_dim.
erDiagram
dim_date ||--o{ fact_sales : "recorded on"
dim_customer ||--o{ fact_sales : "made by"
dim_product ||--o{ fact_sales : "includes"
fact_sales {
int order_id
int customer_id FK
int product_id FK
date date_id FK
float revenue
int quantity
}
dim_customer {
int customer_id PK
string name
string region
}
ETL vs. ELT¶
- ETL (Extract, Transform, Load): Data is transformed before it is loaded into the warehouse. Common in older, on-premise databases where storage was expensive.
- ELT (Extract, Load, Transform): Data is loaded raw into the warehouse, and then transformed using the processing power of the warehouse itself. This is the modern standard (e.g., using
dbtand BigQuery/Snowflake).
Slowly Changing Dimensions (SCD)¶
How do you handle a dimension that changes over time (e.g., a customer moves to a new city)?
| Type | How it works | Pros/Cons |
|---|---|---|
| Type 1 | Overwrite old data with new data. | Easy, but loses historical context. |
| Type 2 | Add a new row with the new data, and use valid_from and valid_to dates. |
Best for analytics; preserves history. |
| Type 3 | Add a new column to the existing row (e.g., previous_city). |
Rarely used; limited history. |