Skip to content

Python Libraries for Data Analysis

The core stack. Install with uv:

uv add pandas numpy matplotlib seaborn plotly scipy statsmodels scikit-learn jupyter

pandas — tabular data

DataFrame and Series. The workhorse.

import pandas as pd
df = pd.read_csv('data.csv')
df.groupby('country')['revenue'].sum()

Docs: pandas.pydata.org

NumPy — numerical arrays

Underlies pandas. Vectorized math.

import numpy as np
a = np.array([1, 2, 3])
np.mean(a); np.std(a); np.percentile(a, 95)

Docs: numpy.org

matplotlib — plotting

Foundational charting library.

import matplotlib.pyplot as plt
plt.plot(x, y); plt.xlabel('Date'); plt.ylabel('Revenue'); plt.show()

Docs: matplotlib.org

seaborn — statistical viz

Built on matplotlib, opinionated, prettier defaults.

import seaborn as sns
sns.boxplot(data=df, x='category', y='revenue')
sns.heatmap(df.corr(), annot=True)
sns.pairplot(df)

Docs: seaborn.pydata.org

plotly — interactive viz

Hoverable, zoomable, dashboard-ready.

import plotly.express as px
px.scatter(df, x='ad_spend', y='revenue', color='channel', size='clicks')

Docs: plotly.com/python

SciPy — scientific computing

Statistical tests, optimization, signal processing.

from scipy import stats
stats.ttest_ind(group_a, group_b)
stats.pearsonr(x, y)
stats.zscore(arr)

Docs: scipy.org

statsmodels — econometrics & stats models

Linear / logistic regression, time series, ANOVA. Returns rich statistical summaries.

import statsmodels.api as sm
X = sm.add_constant(df[['x1', 'x2']])
model = sm.OLS(df['y'], X).fit()
print(model.summary())

Docs: statsmodels.org

scikit-learn — machine learning

Models, preprocessing, evaluation.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression().fit(X_train, y_train)
model.score(X_test, y_test)

Docs: scikit-learn.org

Jupyter — interactive notebooks

uv add jupyter
uv run jupyter lab

Docs: jupyter.org

Other useful libraries

Library Purpose
polars Fast DataFrame in Rust; pandas alternative
duckdb In-process SQL on parquet/CSV/pandas
pyarrow Columnar in-memory format
great-expectations Data validation framework
pandera Schema validation for DataFrames
requests HTTP API calls
openpyxl Excel I/O
sqlalchemy SQL ORM/connection

References