Supervised Learning Models

What is supervised learning?

In supervised learning, you give the model both the inputs and the correct outputs during training. It learns the mapping from one to the other. At prediction time, it applies that learned mapping to inputs it has never seen. The "supervision" is the labeled training data.

Most production ML systems run on supervised learning. Fraud detection, churn prediction, price forecasting, medical diagnosis, recommendation scoring: these are classification or regression problems with labeled historical data. That's the domain this series lives in.

Within supervised learning, there are three core task types. Regression predicts a continuous number (house price, temperature, demand). Classification predicts a discrete class (spam vs not, churn vs retain, healthy vs diseased). Ranking orders items by relevance (search results, feed ranking, recommendations). The model families below handle one or more of these.

Where Supervised Learning Fits

Supervised Learning ← this series

Unsupervised Learning: find structure without labels

Semi-Supervised: labeled + unlabeled data together

Reinforcement Learning: learn from reward signals

Five families, one decision

Every supervised learning model belongs to one of five families, based on how it constructs the input-to-output mapping. Knowing which family a model belongs to tells you its core assumptions, where it tends to break down, and whether it's even the right class of tool for your problem.

Family 01

Linear Models

Assumes the output is a linear combination of inputs. Fast, interpretable, and a strong baseline when that assumption holds.

Family 02

Instance-Based

No training phase. The model memorizes the data and predicts by finding the nearest stored examples at inference time. KNN is the canonical case.

Family 03

Probabilistic

Models the data distribution explicitly using probability theory. Naive Bayes is the classic. Principled but sensitive to distributional assumptions.

Family 04
Tree-Based Models
Builds the mapping using decision trees as building blocks. No distributional assumptions, handles missing values and mixed types naturally. The most practical family for tabular data in production.

Family 05

Neural Networks

Layers of learned transformations. The dominant approach for unstructured data: images, text, audio. Increasingly competitive on tabular data too.

How to read this page

Each family block below covers the intuition and failure modes for that family, then lists every model in it with a short description. Models with a completed deep dive are linked directly. Everything else is marked coming soon and will be added over time.

Model families

Family 01 Linear Models Coming soon

Linear models assume the output is a weighted sum of the input features. For regression, that's a direct linear combination. For classification, a linear boundary separates the classes. This is a strong assumption, and it pays off when the data actually has that structure: fast training, fast inference, and coefficients you can hand to a stakeholder and explain.

The catch is that real-world relationships are rarely linear. You can extend these models with polynomial features or kernel tricks (SVM), but at that point you're adding complexity that a tree-based model handles more naturally. That said, linear models make excellent baselines. If a logistic regression already solves your problem, adding a gradient booster is just overhead.

Regularization (Ridge, Lasso, ElasticNet) is critical in practice. Without it, linear models overfit whenever features outnumber samples or when features are correlated. Lasso also does implicit feature selection by driving irrelevant coefficients to exactly zero.

01-A

Linear & Logistic Regression

The starting point for any ML problem. Regression for continuous output, logistic for classification. The place to understand gradient descent before anything else.

Coming soon

01-B

Ridge, Lasso & ElasticNet

Regularized linear models. Ridge penalizes large weights, Lasso zeros out irrelevant ones, ElasticNet does both. Essential when features outnumber samples.

Coming soon

01-C

Support Vector Machines

Finds the maximum-margin boundary between classes. The kernel trick maps inputs to higher-dimensional space where a linear boundary becomes non-linear in the original space.

Coming soon

Family 02 Instance-Based Models Coming soon

Instance-based models are the odd one out: there is no training phase in the traditional sense. The model just stores the training data. When a new input arrives, it finds the most similar stored examples (by distance) and uses their labels to make a prediction. All the work happens at inference time, not training time.

K-Nearest Neighbors is the canonical example. You pick K, and at prediction time it finds the K closest points in the training set (by Euclidean distance, cosine similarity, or whatever metric fits the problem) and takes a majority vote for classification or an average for regression.

The tradeoffs are real. No training cost, but inference cost scales linearly with training set size. Predictions are fully interpretable since you can point to exactly which training examples drove the output. But the model is extremely sensitive to feature scale (feature scaling is mandatory) and degrades fast in high-dimensional spaces because distance metrics stop being meaningful. This is the curse of dimensionality hitting this family the hardest.

Note

Instance-based models are less common in production than they used to be. Tree-based models and neural networks usually outperform them at scale. But KNN still shows up in recommendation systems (item similarity), anomaly detection, and as a simple strong baseline.

02-A

K-Nearest Neighbors (KNN)

Predicts based on the K most similar training examples. No explicit model, no training. Useful as a baseline, and genuinely good for small datasets with well-scaled features.

Coming soon

02-B

Radius Neighbors

A variant of KNN where instead of a fixed K, all neighbors within a given radius vote. More adaptive in regions of varying density.

Coming soon

Family 03 Probabilistic Models Coming soon

Probabilistic models build an explicit statistical model of the data and use Bayes' theorem to compute class probabilities directly. Instead of learning a decision boundary, they model the probability of each class given the input features.

Naive Bayes is the most common member of this family. It's called "naive" because it assumes all features are conditionally independent given the class label. That assumption is almost never literally true, but the model works surprisingly well in practice, especially on text. For spam filtering and document classification it's been competitive with much more complex models for decades.

The appeal is speed and interpretability. Naive Bayes trains in a single pass through the data (just computing frequencies), handles high-dimensional feature spaces well, and works with very small datasets. The failure mode is when feature independence is badly violated and the probability estimates become unreliable, though the classification decision is often still correct even when the probabilities themselves are off.

03-A

Naive Bayes

Fast probabilistic classifier. Three variants: Gaussian (continuous features), Multinomial (word counts), Bernoulli (binary features). The go-to for text classification baselines.

Coming soon

03-B

Linear Discriminant Analysis

Models each class as a Gaussian, finds the linear boundary that maximally separates them. Also useful as a dimensionality reduction technique before classification.

Coming soon

Family 04 Tree-Based Models Live

This is the most practical family for tabular data. Tree-based models make no assumptions about feature distributions, handle missing values natively in some implementations, work on mixed numeric and categorical data, and capture non-linear interactions between features without any feature engineering.

The family splits into two strategies: bagging and boosting. Bagging (Random Forest) trains many trees independently on random data subsets and averages their predictions. Errors are reduced by variance cancellation. Boosting (XGBoost, LightGBM, CatBoost) trains trees sequentially, with each new tree correcting the errors of the previous ensemble. Boosting tends to hit higher accuracy but requires more tuning and is easier to overfit.

Gradient boosting methods dominate Kaggle competitions on tabular data for a reason: they're robust, well-regularized, handle feature interactions automatically, and scale to datasets too large for neural networks to be practical without massive compute.

04-A

Decision Trees

The atomic building block. A flowchart of if/else splits that partitions data into leaf nodes. Interpretable and fast, but high variance on its own: a single tree overfits easily.

Coming soon

04-B

Random Forest

Many trees trained on random data and feature subsets. Averages out individual tree errors. Hard to overfit and requires minimal tuning. The solid default before reaching for boosting.

Coming soon

04-C

XGBoost

The framework that kicked off the gradient boosting era on Kaggle. Second-order gradients, level-wise tree growth, built-in L1/L2 regularization, and the best documentation in the space.

Coming soon

04-D

LightGBM

Microsoft's take on gradient boosting. Histogram-based split finding, leaf-wise tree growth, GOSS sampling, and EFB feature bundling. Faster than XGBoost at scale with competitive accuracy.

Read →

04-E

CatBoost

Yandex's gradient boosting framework, built around native categorical handling. Symmetric trees, ordered target encoding, strong out-of-the-box defaults. Best choice when your data is heavy on categoricals.

Coming soon

Family 05 Neural Networks Coming soon

Neural networks learn by stacking layers of transformations, where each layer extracts increasingly abstract features from the input. A feedforward network is the baseline: inputs flow forward through weighted layers, a loss is computed at the output, and gradients flow backward to update the weights. That's backpropagation.

From that foundation, specialized architectures handle specific data types. Convolutional networks (CNNs) exploit spatial structure in images. Recurrent networks (RNNs, LSTMs) handle sequences. Transformers replaced both of those for most tasks by 2021 and are now the architecture behind basically every large model you've heard of.

On tabular data, neural networks historically underperform tree-based models without significant effort. They need more data, more tuning, and more compute for equivalent results. Architectures like TabNet and FT-Transformer have closed the gap somewhat, but gradient boosting is still the first thing to reach for on structured data.

05-A

Feedforward Networks

The foundation. Layers of neurons with activation functions, trained via backpropagation. Getting this right (initialization, activation choice, regularization) is a prerequisite for everything else.

Coming soon

05-B

CNNs

Convolutional networks for spatial data. Filters slide across inputs to detect local patterns. The reason deep learning took off in computer vision.

Coming soon

05-C

Transformers

Attention-based architecture that processes sequences in parallel. Now the dominant architecture for NLP, vision, and multimodal tasks. The core of every major LLM.

Coming soon

← Back to Learning Log