Analyze¶
Identify trends and relationships. Answer questions. Solve problems.
Why this phase matters¶
This is where curiosity turns into insight. Without disciplined exploration, you'll either confirm what you already believed (confirmation bias) or drown in numbers without finding the signal.
4 phases of analysis¶
- Organize data — sort, group, structure
- Format and adjust — types, units, granularity
- Get input from others — peer review, stakeholder sanity check
- Transform — pivot, reshape, derive new columns
EDA — Exploratory Data Analysis¶
The process of investigating, organizing, and analyzing datasets and summarizing main characteristics, using data wrangling and visualization.
Practices¶
| Practice | Description |
|---|---|
| Discovering | Familiarize with data; conceptualize use |
| Structuring | Organize/transform raw data — sort, extract, filter, slice, group, merge |
| Cleaning | Remove errors that distort data |
| Joining | Augment with values from other datasets |
| Validating | Verify consistency and quality |
| Presenting | Make cleaned data/visuals available to others |
Standard EDA workflow¶
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
# 1. Shape, types, summary
df.shape; df.dtypes; df.info()
df.describe(include='all')
# 2. Missing & duplicates
df.isna().mean().sort_values(ascending=False)
df.duplicated().sum()
# 3. Distributions
df.hist(figsize=(12, 8), bins=30)
# 4. Correlations
sns.heatmap(df.select_dtypes('number').corr(), annot=True)
# 5. Relationships
sns.pairplot(df.sample(min(500, len(df))))
See Python → EDA for the full pattern.
Data validation¶
| Type | What to check |
|---|---|
| Data type | int vs string vs date |
| Data range | min/max bounds |
| Data constraints | NOT NULL, UNIQUE, FK |
| Data consistency | Same format across rows |
| Data structure | Schema correctness |
| Code validation | Logic checks in queries |
In Google Sheets: Data > Data validation to set rules and prevent mis-input.
Conditional formatting — visually surface outliers.
Data aggregation¶
Combine data into a whole.
| Tool | Pattern |
|---|---|
| Spreadsheet | VLOOKUP, XLOOKUP, SUMIFS, pivot tables |
| SQL | GROUP BY, SUM, AVG, COUNT, MAX, MIN, window functions |
| pandas | df.groupby('col').agg({...}), pivot_table, crosstab |
SELECT
country,
COUNT(*) AS orders,
COUNT(DISTINCT user_id) AS customers,
SUM(total) AS revenue,
AVG(total) AS avg_order
FROM orders
GROUP BY country
HAVING SUM(total) > 10000
ORDER BY revenue DESC;
df.groupby('country').agg(
orders=('id', 'count'),
customers=('user_id', 'nunique'),
revenue=('total', 'sum'),
avg_order=('total', 'mean'),
)
Common analytical patterns¶
| Pattern | When to use |
|---|---|
| Cohort analysis | Track behavior across signup cohorts (retention, monetization) |
| Funnel analysis | Find drop-off across sequential steps |
| Segmentation | Group users/customers by behavior or attribute |
| Time series decomposition | Trend + seasonality + residual |
| A/B comparison | Test vs. control |
| Cross-tab / pivot | Counts at the intersection of two categoricals |
| Top N per group | Best customers per region, etc. |
| Year-over-year / period-over-period | Growth rates |
Snippets in SQL → Snippets and Python → Snippets.
Temporary Tables (SQL)¶
Use cases:
- Pre-processing / staging
- Combine results from multiple queries
- Store filtered subset (skip refilter each time)
-- CTE — preferred for one-off
WITH filtered AS (
SELECT * FROM orders WHERE status = 'paid'
)
SELECT customer_id, SUM(total) FROM filtered GROUP BY customer_id;
-- Temp table — when reused across queries in a session
CREATE TEMP TABLE staging AS
SELECT * FROM raw WHERE created_at >= '2026-01-01';
DROP TABLE staging;
Best practices:
- Global vs. local — global for all DB users (deleted on all-disconnect); local for current session
- Always
DROP TABLEafter use - Auto-deleted at SQL session end
Troubleshooting questions¶
- How should I prioritize these issues?
- In one sentence, what's the issue I'm facing?
- What resources can help me solve the problem?
- How can I prevent this issue in the future?
- What's the simplest version of this analysis that still answers the question?
Self-questions¶
- Does the data align with the PLAN in PACE?
- Do you have enough data to follow through?
- What surprises did you find?
- What would change my mind about my hypothesis?
- Have I controlled for confounders?
- Am I confusing correlation with causation?
Bias to watch for¶
| Bias | Description |
|---|---|
| Confirmation | Seeing only evidence supporting your hypothesis |
| Survivorship | Analyzing only "survivors" (e.g., active users) |
| Selection | Sample isn't representative |
| Anchoring | First number heard biases later estimates |
| Recency | Overweighting recent data |
| Cherry-picking | Choosing the time window that supports the story |
Checklist¶
- Data aggregated, useful, accessible
- Organized and formatted
- Calculations performed
- Trends and relationships identified
- Findings peer-reviewed
- Bias considered explicitly
- Deliverable: summary of analysis