Python Libraries for Data Analysis¶
The core stack. Install with uv:
pandas — tabular data¶
DataFrame and Series. The workhorse.
Docs: pandas.pydata.org
NumPy — numerical arrays¶
Underlies pandas. Vectorized math.
Docs: numpy.org
matplotlib — plotting¶
Foundational charting library.
import matplotlib.pyplot as plt
plt.plot(x, y); plt.xlabel('Date'); plt.ylabel('Revenue'); plt.show()
Docs: matplotlib.org
seaborn — statistical viz¶
Built on matplotlib, opinionated, prettier defaults.
import seaborn as sns
sns.boxplot(data=df, x='category', y='revenue')
sns.heatmap(df.corr(), annot=True)
sns.pairplot(df)
Docs: seaborn.pydata.org
plotly — interactive viz¶
Hoverable, zoomable, dashboard-ready.
import plotly.express as px
px.scatter(df, x='ad_spend', y='revenue', color='channel', size='clicks')
Docs: plotly.com/python
SciPy — scientific computing¶
Statistical tests, optimization, signal processing.
Docs: scipy.org
statsmodels — econometrics & stats models¶
Linear / logistic regression, time series, ANOVA. Returns rich statistical summaries.
import statsmodels.api as sm
X = sm.add_constant(df[['x1', 'x2']])
model = sm.OLS(df['y'], X).fit()
print(model.summary())
Docs: statsmodels.org
scikit-learn — machine learning¶
Models, preprocessing, evaluation.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression().fit(X_train, y_train)
model.score(X_test, y_test)
Docs: scikit-learn.org
Jupyter — interactive notebooks¶
Docs: jupyter.org
Other useful libraries¶
| Library | Purpose |
|---|---|
| polars | Fast DataFrame in Rust; pandas alternative |
| duckdb | In-process SQL on parquet/CSV/pandas |
| pyarrow | Columnar in-memory format |
| great-expectations | Data validation framework |
| pandera | Schema validation for DataFrames |
| requests | HTTP API calls |
| openpyxl | Excel I/O |
| sqlalchemy | SQL ORM/connection |