California Housing ML Deep Dive

ML Pipeline — Where We Are

1. Load Data

→

2. EDA

→

3. Visualize

→

4. Feature Engineering

→

5. Sampling

→

6. Model Training

→

7. Evaluation

Chapter 1

Loading the Dataset

What problem triggered this step?

Before building any model, we need data. The project goal is to predict median house prices across California districts using census statistics — population, income levels, room counts, and geographic coordinates.

The first practical question was: how do we load this dataset and inspect its structure so we can start reasoning about it? At this stage we are not training models. We simply want to get the data into a workable form.

What is this concept?

This step is called data ingestion — reading the dataset, verifying it loaded correctly, and confirming its structure makes sense. In any ML project, this is non-negotiable. If the data loads wrong, every downstream step breaks silently.

In this dataset: each row is one California district, and each column is a feature describing that district.

How does it apply here?

The dataset contains: longitude, latitude, housing_median_age, total_rooms, population, median_income, and our target variable: median_house_value. Everything except the target becomes an input feature for prediction.

The code — explained

import pandas as pd
import numpy as np

Standard imports. Pandas handles the dataframe. NumPy handles numerical operations.

df = pd.read_csv("/path/to/housing.csv")

Loads the CSV into a Pandas DataFrame — essentially a spreadsheet in Python memory.

df.index = range(1, len(df) + 1)

Resets the row index to start at 1 instead of 0. Not required for modeling, but improves readability during exploration.

Real error encountered: The notebook initially threw a FileNotFoundError. The fix was checking the actual working directory with os.getcwd() and correcting the relative path. A simple mistake that blocks everything downstream.

What could go wrong?

File path errors (what actually happened here)
Encoding problems with special characters
Corrupted CSV files or missing headers
Incorrect column types inferred by pandas

All of these must be caught before they become invisible bugs deeper in the pipeline.

What assumption are we making?

We assume each row is an independent district and that the dataset is reasonably representative of California housing. These assumptions matter when we evaluate model performance.

What did we observe?

The dataset loads successfully with 20,640 rows and 10 columns — a moderate-sized dataset well suited for experimentation.

Key Takeaway

Before any ML work begins, the dataset must be cleanly loaded and structurally understood. Everything else depends on this.

Concepts Introduced

Connected To

→ Chapter 2: EDA → Chapter 5: Data Preparation

Chapter 2

First Exploratory Data Analysis

What problem triggered this step?

Once the dataset loaded, the next question became: what does this data actually look like? Before building models, we need to understand how many features exist, whether there are missing values, and what the scale and distribution of each feature looks like. Without this, model building is guesswork.

What is this concept?

Exploratory Data Analysis (EDA) is the practice of systematically examining a dataset before modelling it. It answers questions like: Are there missing values? Are features skewed? Are values capped? Are there obvious patterns?

EDA is what prevents you from building a model on misunderstood data.

How does it apply here?

The housing dataset includes variables like population and room counts that vary dramatically across districts. Understanding these distributions tells us whether to scale features, whether to engineer new ones, and how to split the dataset fairly later on.

The code — explained

df.head(5)

Prints the first five rows. Lets you visually verify columns loaded correctly and values look plausible.

df.describe()

Produces statistical summaries — mean, median, standard deviation, min, max. This is where you catch skewed distributions and capped values before they become model problems.

df.shape  # (20640, 10)

Confirms the dataset dimensions: 20,640 rows, 10 columns.

df.info()

Shows column types, non-null counts, and memory usage. Critical discovery: total_bedrooms has missing values.

What assumption are we making?

We assume the statistical summaries reflect the true structure of the data. But some values may be censored — and we confirmed this: median_house_value appears capped around $500,000. This is a limitation to carry forward into model evaluation.

What did we observe?

20,640 districts total
total_bedrooms has missing values — must be handled before modeling
Several features are highly right-skewed
Target variable appears capped at ~$500k

Key Takeaway

EDA prevents blind model building by surfacing data quality issues and structural patterns before you write a single line of modeling code.

Concepts Introduced

Connected To

→ Chapter 3: Visualization → Chapter 5: Data Preparation

Chapter 3

Visualizing the Data

What problem triggered this step?

Statistics tell one story. Visuals tell another. Some patterns — especially geographic ones — are invisible in a table but immediately obvious in a chart. The question here was: what can we see in this data that summary stats can't show us?

What is this concept?

Data visualization uses charts and plots to reveal structure that numbers obscure — skewed distributions, geographic clusters, nonlinear relationships, and density patterns. Many of the most important ML insights come from looking at the data rather than computing it.

How does it apply here?

The housing dataset includes longitude and latitude. That means we can literally plot California's geography and see whether location correlates with price. If it does, we should see spatial clustering — and we do.

The code — explained

df.hist(bins=50, figsize=(12, 8))

Generates histograms for every numerical feature at once. Reveals skew, caps, and unusual distributions across the whole dataset in a single call.

df.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)

Each dot is a district plotted by geographic coordinates. The alpha=0.1 transparency means overlapping dots create darker regions — density becomes visible without any extra computation.

housing.plot(
    kind="scatter",
    x="longitude",
    y="latitude",
    alpha=0.4,
    s=housing["population"] / 100,  # dot size = population
    label="population",
    c="median_house_value",          # dot color = price
    cmap="jet",
    colorbar=True
)

Four variables in one chart. Position encodes geography. Dot size encodes population. Color encodes house price. Blue → green → yellow → red maps cheap → expensive.

Visual Property	Variable Encoded
X / Y position	Geographic coordinates (longitude, latitude)
Dot size	Population
Dot color	Median house value
Color scale (jet)	Blue = cheap → Red = expensive

What assumption are we making?

We assume geographic clustering reflects real market dynamics — proximity to the coast, urban economic activity, job density. These are reasonable assumptions about California real estate.

What did we observe?

Coastal districts are significantly more expensive. Dense urban clusters appear around Los Angeles and the Bay Area. Inland areas trend cheaper. Location is clearly a strong predictor — and this plot made it immediately obvious.

Key Takeaway

Geography strongly influences housing prices. Visualization made this obvious in seconds — something that would have taken much longer to confirm statistically alone.

Concepts Introduced

Connected To

→ Chapter 4: Feature Engineering → Chapter 2: EDA

Chapter 4

Feature Engineering

What problem triggered this step?

Correlation analysis revealed something surprising: many raw features had weak correlation with house prices. total_rooms barely moved the needle. population was almost flat. This suggested the raw features weren't expressing the real signal.

The question became: can we create better features from what we already have?

What is this concept?

Feature engineering transforms raw variables into more informative representations. Often the best predictors aren't raw counts — they're ratios or densities that normalize for scale differences between districts.

This is a core systems thinking insight: the same raw number means something entirely different depending on context.

How does it apply here?

Consider two districts that both have 2,000 total rooms. One has 100 households. The other has 400. The raw count is identical — but one district is significantly more spacious. Without deriving rooms_per_household, the model can't distinguish them.

The code — explained

df['rooms_per_household'] = df['total_rooms'] / df['households']

Average rooms per home. Normalizes for district size. Directly measures spaciousness rather than raw count.

df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']

What fraction of all rooms are bedrooms? Lower values suggest more living space. Higher values suggest cramped conditions.

df['population_per_household'] = df['population'] / df['households']

Population density per home. Captures household crowding more meaningfully than raw population.

What could go wrong?

Division by zero if denominators contain nulls or zeros
Ratios can amplify noise in small districts
Engineered features can become redundant with existing ones

All engineered features must be validated empirically — intuition alone isn't enough.

What did we observe?

After engineering, correlation improved measurably. bedrooms_per_room in particular showed stronger correlation with median_house_value than the raw bedroom or room counts ever did. The ratio captured something the raw numbers couldn't.

Key Takeaway

Raw counts rarely capture meaningful structure. Feature engineering transforms data into predictive signals by encoding relationships, not just values.

Concepts Introduced

Connected To

→ Chapter 3: Visualization → Chapter 5: Stratified Sampling

Chapter 5

Stratified Sampling

What problem triggered this step?

EDA revealed that median_income has the strongest single-feature correlation with house prices. This creates a sampling problem: if we randomly split the dataset into train and test sets, we might end up with different income distributions in each. A model trained on one income distribution and evaluated on another gives a misleading picture of real-world performance.

What is this concept?

Stratified sampling ensures that important variables maintain the same proportional distribution across train and test sets. Instead of splitting purely at random, we split within predefined categories of the key variable — guaranteeing balance.

This matters because ML models generalize based on the patterns they see in training. If the training distribution doesn't match the real-world distribution, the model is learning the wrong thing.

The code — explained

df['income_cat'] = pd.cut(
    df['median_income'],
    bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],
    labels=[1, 2, 3, 4, 5]
)

Groups income into 5 categories based on the distribution observed in EDA. This is the variable we'll stratify on.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

One split, 80/20 ratio, fixed seed for reproducibility. random_state=42 means anyone running this notebook gets the same split.

for train_index, test_index in split.split(df, df['income_cat']):
    strat_train_set = df.iloc[train_index]
    strat_test_set  = df.iloc[test_index]

Executes the split while preserving income category proportions in both resulting sets.

What could go wrong?

If we stratify on the wrong variable — one that doesn't actually matter — we add complexity without benefit. The choice of stratification variable must be justified by correlation analysis, not intuition alone. We're justified here because income is the strongest predictor we found.

What assumption are we making?

We assume income distribution must remain consistent between train and test sets for evaluation to be meaningful. Given the strong correlation with the target, this assumption is clearly justified.

What did we observe?

The train and test sets now maintain nearly identical income distributions. Evaluation will reflect how the model performs across the full range of income levels — not just whichever ones happened to cluster in a random split.

Key Takeaway

Stratified sampling prevents evaluation bias by ensuring the test set is truly representative — not just randomly selected.

Concepts Introduced

Connected To

→ Chapter 4: Feature Engineering → Chapter 6: Model Training (coming soon)