Datasets¶
Government / open data¶
- data.gov — US federal data
- data.gov.uk — UK
- Eurostat — EU
- census.gov/data — US Census
- Open Data Network — multi-source aggregator
- World Bank Open Data
- WHO data
Cloud platforms (free public sets)¶
Search engines¶
Domain-specific¶
Machine learning¶
- UCI ML Repository — classic ML datasets
- OpenML
- Kaggle competitions data
Finance¶
Crypto / blockchain¶
Geographic¶
- OpenStreetMap
- Natural Earth — vector and raster
- USGS
Health¶
Sports¶
Climate / weather¶
Sample / classic datasets¶
| Dataset | Use |
|---|---|
| Iris | Classification basics |
| Titanic | Binary classification, missing data |
| Boston Housing (deprecated for racial bias) → California Housing | Regression |
| MNIST | Image classification |
| NYC Taxi | Time series, geo |
| MovieLens | Recommender systems |
| Adult / Census Income | Classification |
| IMDb reviews | NLP, sentiment |
Synthetic data¶
- Faker (Python) — generate fake but realistic data
- Mockaroo — web-based generator
- SDV (Synthetic Data Vault) — model-based synthesis
When choosing a dataset for portfolio¶
- Real (ROCCC) — Reliable, Original, Comprehensive, Current, Cited
- Story-rich — has interesting patterns to find
- Cleanable — needs work to demonstrate skill
- Sized appropriately — large enough to matter, small enough to ship
- Public / shareable — license allows posting analysis