Concept Exploration

Data Preprocessing

The conditioning layer between raw data and a model that can learn. Each chapter covers one step in the pipeline, why it exists, and how to apply it without breaking things downstream.

What is it?

Raw data is messy, inconsistently scaled, and not shaped for learning algorithms. Preprocessing is the set of transformations applied to raw data before it goes near a model. It does not create new information. It restructures existing information so the algorithm's internal machinery can work correctly.

A lot of people mix up EDA, preprocessing, and feature engineering because they all happen before training. They are not the same thing. EDA is for understanding the data without touching it. Preprocessing fixes and reshapes what is there. Feature engineering then builds new variables on top of that cleaned foundation. Getting the order wrong, or skipping a step, usually shows up later as a quiet bug that is painful to trace.

Where Preprocessing Sits in the Pipeline
Raw Data
|
EDA — understand, don't change
|
Preprocessing ← this series
|
Feature Engineering
|
Model Training → Evaluation → Deployment

Skipping preprocessing does not mean nothing happens. It means the algorithm receives raw, inconsistent data and handles it on its own. Distance-based models go blind to small-scale features. Gradient-based models converge slowly or not at all. Tree models are largely immune to it. Knowing which situation you are in and why is what this series is for.

Topics in this series

Each topic is a standalone deep dive. They follow a logical order but can be read on their own. The goal is a complete mental model of each step, not just a how-to.

01

Feature Scaling

Why algorithms break on unscaled data, which method fits which situation, and how to apply it correctly without leaking the test set into your training process.

Read →
02

Feature Encoding

Making categorical variables usable by algorithms that only work with numbers. Ordinal, one-hot, and target encoding, and when each one actually makes sense.

Coming soon
03

Handling Missing Data

Strategies for gaps in the dataset. Mean imputation, model-based imputation, and when it is better to just drop the row or column and move on.

Coming soon
04

Outlier Treatment

How to find extreme values, decide whether they are signal or noise, and handle them without distorting what the rest of the data is trying to say.

Coming soon
← Back to Learning Log