Skip to content

Datasets

Government / open data

Cloud platforms (free public sets)

Search engines

Domain-specific

Machine learning

Finance

Crypto / blockchain

Geographic

Health

Sports

Climate / weather

Sample / classic datasets

Dataset Use
Iris Classification basics
Titanic Binary classification, missing data
Boston Housing (deprecated for racial bias) → California Housing Regression
MNIST Image classification
NYC Taxi Time series, geo
MovieLens Recommender systems
Adult / Census Income Classification
IMDb reviews NLP, sentiment

Synthetic data

When choosing a dataset for portfolio

  • Real (ROCCC) — Reliable, Original, Comprehensive, Current, Cited
  • Story-rich — has interesting patterns to find
  • Cleanable — needs work to demonstrate skill
  • Sized appropriately — large enough to matter, small enough to ship
  • Public / shareable — license allows posting analysis