Glossary¶

Searchable reference of terms.

A¶

Term	Meaning
Alternative hypothesis (H₁)	Statement that contradicts the null hypothesis; accepted only with sufficient evidence
ANOVA	Analysis of Variance; tests difference of means between three or more groups
ANCOVA	Analysis of Covariance; ANOVA with control for covariates
ARPU	Average Revenue Per User
Array	Collection of values in cells (spreadsheet)

Term	Meaning
Backward elimination	Variable selection: start full, remove the variable adding least explanatory power
Bagging	Bootstrap + aggregating; ensemble method
Base learner	Each individual model in an ensemble
Bayes' theorem	$P(A
Bayesian inference	Powerful method for analyzing data using prior + posterior
Bias (model)	Simplifies predictions via assumptions; high bias → underfit
Bias-variance tradeoff	Balance between bias and variance to minimize generalization error
Binomial distribution	Discrete; models successes in n independent trials
Bins	Tableau term for custom segments
Blackbox model	Model whose predictions can't be precisely explained
Bounce rate	% of visitors who leave after viewing one page

Term	Meaning
CAC	Customer Acquisition Cost
Categorical variables	Variables with finite groups
Causation	Cause-and-effect; X directly causes Y
Centroid	Center of a cluster (mean of all points)
Child node	Node pointed to from another node (decision tree)
CLT	Central Limit Theorem
CLV / LTV	Customer Lifetime Value
Confidence band	Area around regression line showing uncertainty
Confidence interval	Range expected to contain a parameter at given confidence
Confidence level	Probability sample reflects population (e.g., 95%)
Confusion matrix	TP/TN/FP/FN visualization for classifier
Content-based filtering	Recommendation based on content attributes
Continuous variables	Infinite, uncountable values
Conversion rate	% completing a desired action
Customer churn	Rate at which customers stop using a product

Term	Meaning
Data tidying	Structuring datasets to facilitate analysis
Decision node	Tree node where decisions are made
Decision tree	Flow-chart-like supervised classifier
Dependent events	Two events where one's occurrence changes the other's probability
Dirty data	Incomplete, incorrect, or irrelevant data
Duplicate data	Records appearing more than once
Dummy variables	0/1 variables indicating presence/absence

Term	Meaning
EDA	Exploratory Data Analysis
Empirical rule	68/95/99.7 distribution rule for normal
Ensemble learning	Aggregating model outputs for better prediction
Estimated response rate	Expected % of survey respondents

Term	Meaning
Feature engineering	Using domain knowledge to create features
Field	A single piece of info from a row/column
Field length	Number of characters allowed
Foreign key	Field that's a primary key in another table
Forward selection	Variable selection: start null, add variable with most explanatory power

Term	Meaning
GBMs	Gradient Boosting Machines
Gradient boosting	Boosting where each base learner predicts residuals of the prior
GridSearch	Tool to find best hyperparameter combination

Term	Meaning
Heatmap	Visualization using two colors to show magnitude
Histogram	Visualization of value distribution
Hold-out sample	Random sample not used to fit the model
Homoscedasticity	Constant variance of residuals
Hypothesis	Theory based on evidence, not yet proven
Hypothesis Testing	See if survey/experiment results are meaningful

Term	Meaning
Inertia	Sum of squared distances between observations and nearest centroid
Input validation	Analyzing data to ensure it's complete, error-free, high quality
Intercept	Y when X = 0

Term	Meaning
K-means	Unsupervised clustering algorithm

Term	Meaning
Label encoding	Assigning unique numbers to categories
Leaf node	Tree node where final prediction is made
Library	Reusable collection of code
Likelihood	Probability of observing actual data given parameters
Logit (log-odds)	Logarithm of odds; linear in logistic regression

Term	Meaning
Machine Learning	Use of algorithms/models so systems learn from data
MAE	Mean Absolute Error
MANOVA / MANCOVA	Multivariate ANOVA / ANCOVA
Margin of error	Max amount sample results expected to differ from population
Maximum likelihood estimation (MLE)	Estimate parameters that maximize observation likelihood
Mean	Average value
Measures of central tendency	Mean, median, mode
Measures of dispersion	Range, variance, standard deviation
Measures of position	Position relative to others (percentile, quartile)
Median	Middle value
Mode	Most frequent value
Model assumptions	Statements about data that must hold to use a technique
Modulo	Operator returning remainder (`MOD` in spreadsheets)
MRR / ARR	Monthly / Annual Recurring Revenue
MSE	Mean Squared Error
Multicollinearity	High correlation among predictors

Term	Meaning
Null	Empty field (missing value)
Null hypothesis (H₀)	Assumed true unless evidence shows otherwise

Term	Meaning
OLS	Ordinary Least Squares
One hot encoding	Turn one categorical variable into several binary variables
Outdated data	Old data needing replacement
Outlier	Value far from others; often Q1 − 1.5×IQR or Q3 + 1.5×IQR
Overfitting	Model fits training data too specifically; poor generalization

Term	Meaning
Parameter (stats)	Characteristic of a population
Percentile	Value below which a percentage of data falls
Pivot table	Data summarization tool
Poisson distribution	Probability of n events in fixed interval
Population	Total set you sample from
Predicted values	Estimated Y for each X
Primary key	Column with unique value per row
P-value	Probability of observing results as extreme as observed when H₀ is true

Term	Meaning
Random experiment	Process whose outcome can't be predicted with certainty
Range	Max − Min
Recommendation systems	Unsupervised techniques offering relevant suggestions
Residual	Observed − Predicted
R²	Proportion of variance explained by regressors
Root node	First node of a decision tree

Term	Meaning
Sample	Representative subset of population
Sample size	Number of items in a sample
Sampling	Drawing a subset of data from a population
Scatterplot matrix	Series of pairwise scatterplots
Schema	Description of data organization
Significance level (α)	Probability of rejecting H₀ when true
Silhouette score	Mean of silhouette coefficients of clustered observations
Slope	Y change per unit X change
Sort range	Sort selected cells (isolation)
Sort sheet	Sort entire sheet (rows kept together)
SQL dialect	Variant of SQL (list)
SSR	Sum of Squared Residuals
Standard deviation	Measure of spread from mean
Standardization	Putting variables on the same scale
Statistic	Characteristic of a sample
Statistical Power	0.8 / 80% considered significant
Statistical significance	Results not explainable by chance alone
Summary statistics	Single-number summary of data
Supervised ML	Uses labeled datasets

Term	Meaning
Tidy dataset	Easy to manipulate, model, visualize
Type I error	Rejecting true H₀ (false positive)
Type II error	Failing to reject false H₀ (false negative)
Typecasting	Converting data from one type to another

Term	Meaning
Unsupervised ML	Uses unlabeled data

Term	Meaning
Variance	Average squared difference from mean
Variance Inflation Factor (VIF)	Quantifies correlation among predictors

Term	Meaning
XGBoost	Extreme Gradient Boosting; optimized GBM