Glossary
Searchable reference of terms.
A
| Term |
Meaning |
| Alternative hypothesis (H₁) |
Statement that contradicts the null hypothesis; accepted only with sufficient evidence |
| ANOVA |
Analysis of Variance; tests difference of means between three or more groups |
| ANCOVA |
Analysis of Covariance; ANOVA with control for covariates |
| ARPU |
Average Revenue Per User |
| Array |
Collection of values in cells (spreadsheet) |
B
| Term |
Meaning |
| Backward elimination |
Variable selection: start full, remove the variable adding least explanatory power |
| Bagging |
Bootstrap + aggregating; ensemble method |
| Base learner |
Each individual model in an ensemble |
| Bayes' theorem |
$P(A |
| Bayesian inference |
Powerful method for analyzing data using prior + posterior |
| Bias (model) |
Simplifies predictions via assumptions; high bias → underfit |
| Bias-variance tradeoff |
Balance between bias and variance to minimize generalization error |
| Binomial distribution |
Discrete; models successes in n independent trials |
| Bins |
Tableau term for custom segments |
| Blackbox model |
Model whose predictions can't be precisely explained |
| Bounce rate |
% of visitors who leave after viewing one page |
C
| Term |
Meaning |
| CAC |
Customer Acquisition Cost |
| Categorical variables |
Variables with finite groups |
| Causation |
Cause-and-effect; X directly causes Y |
| Centroid |
Center of a cluster (mean of all points) |
| Child node |
Node pointed to from another node (decision tree) |
| CLT |
Central Limit Theorem |
| CLV / LTV |
Customer Lifetime Value |
| Confidence band |
Area around regression line showing uncertainty |
| Confidence interval |
Range expected to contain a parameter at given confidence |
| Confidence level |
Probability sample reflects population (e.g., 95%) |
| Confusion matrix |
TP/TN/FP/FN visualization for classifier |
| Content-based filtering |
Recommendation based on content attributes |
| Continuous variables |
Infinite, uncountable values |
| Conversion rate |
% completing a desired action |
| Customer churn |
Rate at which customers stop using a product |
D
| Term |
Meaning |
| Data tidying |
Structuring datasets to facilitate analysis |
| Decision node |
Tree node where decisions are made |
| Decision tree |
Flow-chart-like supervised classifier |
| Dependent events |
Two events where one's occurrence changes the other's probability |
| Dirty data |
Incomplete, incorrect, or irrelevant data |
| Duplicate data |
Records appearing more than once |
| Dummy variables |
0/1 variables indicating presence/absence |
E
| Term |
Meaning |
| EDA |
Exploratory Data Analysis |
| Empirical rule |
68/95/99.7 distribution rule for normal |
| Ensemble learning |
Aggregating model outputs for better prediction |
| Estimated response rate |
Expected % of survey respondents |
F
| Term |
Meaning |
| Feature engineering |
Using domain knowledge to create features |
| Field |
A single piece of info from a row/column |
| Field length |
Number of characters allowed |
| Foreign key |
Field that's a primary key in another table |
| Forward selection |
Variable selection: start null, add variable with most explanatory power |
G
| Term |
Meaning |
| GBMs |
Gradient Boosting Machines |
| Gradient boosting |
Boosting where each base learner predicts residuals of the prior |
| GridSearch |
Tool to find best hyperparameter combination |
H
| Term |
Meaning |
| Heatmap |
Visualization using two colors to show magnitude |
| Histogram |
Visualization of value distribution |
| Hold-out sample |
Random sample not used to fit the model |
| Homoscedasticity |
Constant variance of residuals |
| Hypothesis |
Theory based on evidence, not yet proven |
| Hypothesis Testing |
See if survey/experiment results are meaningful |
I
| Term |
Meaning |
| Inertia |
Sum of squared distances between observations and nearest centroid |
| Input validation |
Analyzing data to ensure it's complete, error-free, high quality |
| Intercept |
Y when X = 0 |
K
| Term |
Meaning |
| K-means |
Unsupervised clustering algorithm |
L
| Term |
Meaning |
| Label encoding |
Assigning unique numbers to categories |
| Leaf node |
Tree node where final prediction is made |
| Library |
Reusable collection of code |
| Likelihood |
Probability of observing actual data given parameters |
| Logit (log-odds) |
Logarithm of odds; linear in logistic regression |
M
| Term |
Meaning |
| Machine Learning |
Use of algorithms/models so systems learn from data |
| MAE |
Mean Absolute Error |
| MANOVA / MANCOVA |
Multivariate ANOVA / ANCOVA |
| Margin of error |
Max amount sample results expected to differ from population |
| Maximum likelihood estimation (MLE) |
Estimate parameters that maximize observation likelihood |
| Mean |
Average value |
| Measures of central tendency |
Mean, median, mode |
| Measures of dispersion |
Range, variance, standard deviation |
| Measures of position |
Position relative to others (percentile, quartile) |
| Median |
Middle value |
| Mode |
Most frequent value |
| Model assumptions |
Statements about data that must hold to use a technique |
| Modulo |
Operator returning remainder (MOD in spreadsheets) |
| MRR / ARR |
Monthly / Annual Recurring Revenue |
| MSE |
Mean Squared Error |
| Multicollinearity |
High correlation among predictors |
N
| Term |
Meaning |
| Null |
Empty field (missing value) |
| Null hypothesis (H₀) |
Assumed true unless evidence shows otherwise |
O
| Term |
Meaning |
| OLS |
Ordinary Least Squares |
| One hot encoding |
Turn one categorical variable into several binary variables |
| Outdated data |
Old data needing replacement |
| Outlier |
Value far from others; often Q1 − 1.5×IQR or Q3 + 1.5×IQR |
| Overfitting |
Model fits training data too specifically; poor generalization |
P
| Term |
Meaning |
| Parameter (stats) |
Characteristic of a population |
| Percentile |
Value below which a percentage of data falls |
| Pivot table |
Data summarization tool |
| Poisson distribution |
Probability of n events in fixed interval |
| Population |
Total set you sample from |
| Predicted values |
Estimated Y for each X |
| Primary key |
Column with unique value per row |
| P-value |
Probability of observing results as extreme as observed when H₀ is true |
R
| Term |
Meaning |
| Random experiment |
Process whose outcome can't be predicted with certainty |
| Range |
Max − Min |
| Recommendation systems |
Unsupervised techniques offering relevant suggestions |
| Residual |
Observed − Predicted |
| R² |
Proportion of variance explained by regressors |
| Root node |
First node of a decision tree |
S
| Term |
Meaning |
| Sample |
Representative subset of population |
| Sample size |
Number of items in a sample |
| Sampling |
Drawing a subset of data from a population |
| Scatterplot matrix |
Series of pairwise scatterplots |
| Schema |
Description of data organization |
| Significance level (α) |
Probability of rejecting H₀ when true |
| Silhouette score |
Mean of silhouette coefficients of clustered observations |
| Slope |
Y change per unit X change |
| Sort range |
Sort selected cells (isolation) |
| Sort sheet |
Sort entire sheet (rows kept together) |
| SQL dialect |
Variant of SQL (list) |
| SSR |
Sum of Squared Residuals |
| Standard deviation |
Measure of spread from mean |
| Standardization |
Putting variables on the same scale |
| Statistic |
Characteristic of a sample |
| Statistical Power |
0.8 / 80% considered significant |
| Statistical significance |
Results not explainable by chance alone |
| Summary statistics |
Single-number summary of data |
| Supervised ML |
Uses labeled datasets |
T
| Term |
Meaning |
| Tidy dataset |
Easy to manipulate, model, visualize |
| Type I error |
Rejecting true H₀ (false positive) |
| Type II error |
Failing to reject false H₀ (false negative) |
| Typecasting |
Converting data from one type to another |
U
| Term |
Meaning |
| Unsupervised ML |
Uses unlabeled data |
V
| Term |
Meaning |
| Variance |
Average squared difference from mean |
| Variance Inflation Factor (VIF) |
Quantifies correlation among predictors |
X
| Term |
Meaning |
| XGBoost |
Extreme Gradient Boosting; optimized GBM |