Skip to content

Glossary

Searchable reference of terms.

A

Term Meaning
Alternative hypothesis (H₁) Statement that contradicts the null hypothesis; accepted only with sufficient evidence
ANOVA Analysis of Variance; tests difference of means between three or more groups
ANCOVA Analysis of Covariance; ANOVA with control for covariates
ARPU Average Revenue Per User
Array Collection of values in cells (spreadsheet)

B

Term Meaning
Backward elimination Variable selection: start full, remove the variable adding least explanatory power
Bagging Bootstrap + aggregating; ensemble method
Base learner Each individual model in an ensemble
Bayes' theorem $P(A
Bayesian inference Powerful method for analyzing data using prior + posterior
Bias (model) Simplifies predictions via assumptions; high bias → underfit
Bias-variance tradeoff Balance between bias and variance to minimize generalization error
Binomial distribution Discrete; models successes in n independent trials
Bins Tableau term for custom segments
Blackbox model Model whose predictions can't be precisely explained
Bounce rate % of visitors who leave after viewing one page

C

Term Meaning
CAC Customer Acquisition Cost
Categorical variables Variables with finite groups
Causation Cause-and-effect; X directly causes Y
Centroid Center of a cluster (mean of all points)
Child node Node pointed to from another node (decision tree)
CLT Central Limit Theorem
CLV / LTV Customer Lifetime Value
Confidence band Area around regression line showing uncertainty
Confidence interval Range expected to contain a parameter at given confidence
Confidence level Probability sample reflects population (e.g., 95%)
Confusion matrix TP/TN/FP/FN visualization for classifier
Content-based filtering Recommendation based on content attributes
Continuous variables Infinite, uncountable values
Conversion rate % completing a desired action
Customer churn Rate at which customers stop using a product

D

Term Meaning
Data tidying Structuring datasets to facilitate analysis
Decision node Tree node where decisions are made
Decision tree Flow-chart-like supervised classifier
Dependent events Two events where one's occurrence changes the other's probability
Dirty data Incomplete, incorrect, or irrelevant data
Duplicate data Records appearing more than once
Dummy variables 0/1 variables indicating presence/absence

E

Term Meaning
EDA Exploratory Data Analysis
Empirical rule 68/95/99.7 distribution rule for normal
Ensemble learning Aggregating model outputs for better prediction
Estimated response rate Expected % of survey respondents

F

Term Meaning
Feature engineering Using domain knowledge to create features
Field A single piece of info from a row/column
Field length Number of characters allowed
Foreign key Field that's a primary key in another table
Forward selection Variable selection: start null, add variable with most explanatory power

G

Term Meaning
GBMs Gradient Boosting Machines
Gradient boosting Boosting where each base learner predicts residuals of the prior
GridSearch Tool to find best hyperparameter combination

H

Term Meaning
Heatmap Visualization using two colors to show magnitude
Histogram Visualization of value distribution
Hold-out sample Random sample not used to fit the model
Homoscedasticity Constant variance of residuals
Hypothesis Theory based on evidence, not yet proven
Hypothesis Testing See if survey/experiment results are meaningful

I

Term Meaning
Inertia Sum of squared distances between observations and nearest centroid
Input validation Analyzing data to ensure it's complete, error-free, high quality
Intercept Y when X = 0

K

Term Meaning
K-means Unsupervised clustering algorithm

L

Term Meaning
Label encoding Assigning unique numbers to categories
Leaf node Tree node where final prediction is made
Library Reusable collection of code
Likelihood Probability of observing actual data given parameters
Logit (log-odds) Logarithm of odds; linear in logistic regression

M

Term Meaning
Machine Learning Use of algorithms/models so systems learn from data
MAE Mean Absolute Error
MANOVA / MANCOVA Multivariate ANOVA / ANCOVA
Margin of error Max amount sample results expected to differ from population
Maximum likelihood estimation (MLE) Estimate parameters that maximize observation likelihood
Mean Average value
Measures of central tendency Mean, median, mode
Measures of dispersion Range, variance, standard deviation
Measures of position Position relative to others (percentile, quartile)
Median Middle value
Mode Most frequent value
Model assumptions Statements about data that must hold to use a technique
Modulo Operator returning remainder (MOD in spreadsheets)
MRR / ARR Monthly / Annual Recurring Revenue
MSE Mean Squared Error
Multicollinearity High correlation among predictors

N

Term Meaning
Null Empty field (missing value)
Null hypothesis (H₀) Assumed true unless evidence shows otherwise

O

Term Meaning
OLS Ordinary Least Squares
One hot encoding Turn one categorical variable into several binary variables
Outdated data Old data needing replacement
Outlier Value far from others; often Q1 − 1.5×IQR or Q3 + 1.5×IQR
Overfitting Model fits training data too specifically; poor generalization

P

Term Meaning
Parameter (stats) Characteristic of a population
Percentile Value below which a percentage of data falls
Pivot table Data summarization tool
Poisson distribution Probability of n events in fixed interval
Population Total set you sample from
Predicted values Estimated Y for each X
Primary key Column with unique value per row
P-value Probability of observing results as extreme as observed when H₀ is true

R

Term Meaning
Random experiment Process whose outcome can't be predicted with certainty
Range Max − Min
Recommendation systems Unsupervised techniques offering relevant suggestions
Residual Observed − Predicted
Proportion of variance explained by regressors
Root node First node of a decision tree

S

Term Meaning
Sample Representative subset of population
Sample size Number of items in a sample
Sampling Drawing a subset of data from a population
Scatterplot matrix Series of pairwise scatterplots
Schema Description of data organization
Significance level (α) Probability of rejecting H₀ when true
Silhouette score Mean of silhouette coefficients of clustered observations
Slope Y change per unit X change
Sort range Sort selected cells (isolation)
Sort sheet Sort entire sheet (rows kept together)
SQL dialect Variant of SQL (list)
SSR Sum of Squared Residuals
Standard deviation Measure of spread from mean
Standardization Putting variables on the same scale
Statistic Characteristic of a sample
Statistical Power 0.8 / 80% considered significant
Statistical significance Results not explainable by chance alone
Summary statistics Single-number summary of data
Supervised ML Uses labeled datasets

T

Term Meaning
Tidy dataset Easy to manipulate, model, visualize
Type I error Rejecting true H₀ (false positive)
Type II error Failing to reject false H₀ (false negative)
Typecasting Converting data from one type to another

U

Term Meaning
Unsupervised ML Uses unlabeled data

V

Term Meaning
Variance Average squared difference from mean
Variance Inflation Factor (VIF) Quantifies correlation among predictors

X

Term Meaning
XGBoost Extreme Gradient Boosting; optimized GBM