Skip to content

SQL Optimization

How to write SQL that doesn't melt the database.

Safety checklist (production)

Technique Purpose
EXPLAIN ANALYZE Show plan + actual timing (use with caution on huge tables)
Dry run Run on staging or dev DB with smaller subset first
Count first SELECT COUNT(*) WHERE ... to see how many rows you'll touch
Batching Update in batches of 5,000 to avoid locking
Check execution plan Look for missing indexes or Cartesian products

1. Predict the outcome (without running)

You can't know exact values without running, but you can predict structure and scale.

EXPLAIN

Every major DB has EXPLAIN. It shows the execution plan — the database's intended approach.

  • Cost — arbitrary number representing effort
  • Rows — estimated rows scanned/returned
  • Width — estimated bytes per row
EXPLAIN SELECT * FROM orders WHERE user_id = 42;

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
-- ANALYZE actually runs the query — careful on writes/big tables

Mental mapping (before hitting Run)

  1. GrainGROUP BY customer_id with 5,000 customers = 5,000 rows
  2. FilterWHERE date > '2026-01-01' covers ~10% → 100,000 rows from 1M
  3. Join multiplier — customer joined to orders (50 each) = 50× row blow-up

2. Selectivity — only what you need

Avoid SELECT *

-- Bad
SELECT * FROM orders WHERE user_id = 42;

-- Good
SELECT id, total, created_at FROM orders WHERE user_id = 42;

Wastes memory and I/O on unused columns.

Use LIMIT when testing

SELECT * FROM huge_table LIMIT 10;

Stops after 10 matches.

3. Filter and index

Filter early

Put the most restrictive WHERE clause first.

Sargable queries

Avoid functions on indexed columns in WHERE.

-- Bad — DB must compute YEAR for every row
WHERE YEAR(transaction_date) = 2026

-- Good — DB uses index directly
WHERE transaction_date >= '2026-01-01'
  AND transaction_date <  '2027-01-01'
-- Bad
WHERE LOWER(email) = 'jack@example.com'

-- Good (if you have functional index OR canonicalize on insert)
WHERE email = 'jack@example.com'

Index appropriately

  • B-tree: equality + range
  • Hash: equality only (Postgres)
  • GIN: full-text, JSONB
  • Partial index: WHERE status = 'active' if hot
CREATE INDEX idx_orders_user_id ON orders (user_id);
CREATE INDEX idx_orders_created ON orders (created_at);
CREATE INDEX idx_orders_active  ON orders (user_id) WHERE status = 'active';

4. Joins

  • Join on indexed columns — usually PK ↔ FK
  • Filter before joining — use a CTE/subquery to shrink the table first
-- Bad — joins full tables, then filters
SELECT * FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.country = 'US' AND o.created_at >= '2026-01-01';

-- Good — narrow each side first
WITH us_customers AS (
  SELECT id FROM customers WHERE country = 'US'
), recent AS (
  SELECT * FROM orders WHERE created_at >= '2026-01-01'
)
SELECT * FROM us_customers c JOIN recent o ON o.customer_id = c.id;

5. Batching writes

-- Update 5k at a time to avoid locking 1M rows
UPDATE orders
SET status = 'archived'
WHERE id IN (
  SELECT id FROM orders WHERE created_at < '2025-01-01' LIMIT 5000
);
-- Loop in your app/script until 0 rows affected

Common anti-patterns

Anti-pattern Fix
SELECT * Pick columns
WHERE func(col) = ... Rewrite to be sargable
OR across columns Often blocks index; try UNION
Implicit cast (int_col = '42') Match types
Cartesian join (missing ON) Add join condition
NOT IN with NULLs Use NOT EXISTS
Large IN (...) lists Use JOIN against a temp table

References