Analyze¶

Identify trends and relationships. Answer questions. Solve problems.

Why this phase matters¶

This is where curiosity turns into insight. Without disciplined exploration, you'll either confirm what you already believed (confirmation bias) or drown in numbers without finding the signal.

4 phases of analysis¶

Organize data — sort, group, structure
Format and adjust — types, units, granularity
Get input from others — peer review, stakeholder sanity check
Transform — pivot, reshape, derive new columns

EDA — Exploratory Data Analysis¶

The process of investigating, organizing, and analyzing datasets and summarizing main characteristics, using data wrangling and visualization.

Practices¶

Practice	Description
Discovering	Familiarize with data; conceptualize use
Structuring	Organize/transform raw data — sort, extract, filter, slice, group, merge
Cleaning	Remove errors that distort data
Joining	Augment with values from other datasets
Validating	Verify consistency and quality
Presenting	Make cleaned data/visuals available to others

Standard EDA workflow¶

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

# 1. Shape, types, summary
df.shape; df.dtypes; df.info()
df.describe(include='all')

# 2. Missing & duplicates
df.isna().mean().sort_values(ascending=False)
df.duplicated().sum()

# 3. Distributions
df.hist(figsize=(12, 8), bins=30)

# 4. Correlations
sns.heatmap(df.select_dtypes('number').corr(), annot=True)

# 5. Relationships
sns.pairplot(df.sample(min(500, len(df))))

See Python → EDA for the full pattern.

Data validation¶

Type	What to check
Data type	int vs string vs date
Data range	min/max bounds
Data constraints	NOT NULL, UNIQUE, FK
Data consistency	Same format across rows
Data structure	Schema correctness
Code validation	Logic checks in queries

In Google Sheets: Data > Data validation to set rules and prevent mis-input.
Conditional formatting — visually surface outliers.

Data aggregation¶

Combine data into a whole.

Tool	Pattern
Spreadsheet	`VLOOKUP`, `XLOOKUP`, `SUMIFS`, pivot tables
SQL	`GROUP BY`, `SUM`, `AVG`, `COUNT`, `MAX`, `MIN`, window functions
pandas	`df.groupby('col').agg({...})`, `pivot_table`, `crosstab`

SELECT
  country,
  COUNT(*)              AS orders,
  COUNT(DISTINCT user_id) AS customers,
  SUM(total)            AS revenue,
  AVG(total)            AS avg_order
FROM orders
GROUP BY country
HAVING SUM(total) > 10000
ORDER BY revenue DESC;

df.groupby('country').agg(
    orders=('id', 'count'),
    customers=('user_id', 'nunique'),
    revenue=('total', 'sum'),
    avg_order=('total', 'mean'),
)

Common analytical patterns¶

Pattern	When to use
Cohort analysis	Track behavior across signup cohorts (retention, monetization)
Funnel analysis	Find drop-off across sequential steps
Segmentation	Group users/customers by behavior or attribute
Time series decomposition	Trend + seasonality + residual
A/B comparison	Test vs. control
Cross-tab / pivot	Counts at the intersection of two categoricals
Top N per group	Best customers per region, etc.
Year-over-year / period-over-period	Growth rates

Snippets in SQL → Snippets and Python → Snippets.

Temporary Tables (SQL)¶

Use cases:

Pre-processing / staging
Combine results from multiple queries
Store filtered subset (skip refilter each time)

-- CTE — preferred for one-off
WITH filtered AS (
  SELECT * FROM orders WHERE status = 'paid'
)
SELECT customer_id, SUM(total) FROM filtered GROUP BY customer_id;

-- Temp table — when reused across queries in a session
CREATE TEMP TABLE staging AS
SELECT * FROM raw WHERE created_at >= '2026-01-01';

DROP TABLE staging;

Best practices:

Global vs. local — global for all DB users (deleted on all-disconnect); local for current session
Always DROP TABLE after use
Auto-deleted at SQL session end

Troubleshooting questions¶

How should I prioritize these issues?
In one sentence, what's the issue I'm facing?
What resources can help me solve the problem?
How can I prevent this issue in the future?
What's the simplest version of this analysis that still answers the question?

Self-questions¶

Does the data align with the PLAN in PACE?
Do you have enough data to follow through?
What surprises did you find?
What would change my mind about my hypothesis?
Have I controlled for confounders?
Am I confusing correlation with causation?

Bias to watch for¶

Bias	Description
Confirmation	Seeing only evidence supporting your hypothesis
Survivorship	Analyzing only "survivors" (e.g., active users)
Selection	Sample isn't representative
Anchoring	First number heard biases later estimates
Recency	Overweighting recent data
Cherry-picking	Choosing the time window that supports the story

Checklist¶

Data aggregated, useful, accessible
Organized and formatted
Calculations performed
Trends and relationships identified
Findings peer-reviewed
Bias considered explicitly
Deliverable: summary of analysis