01 Supervised Learning

Supervised learning involves finding a final hypothesis $g$ that approximates an unknown target function $f$ .

Core Data Structures

Input Space ( $X$ ): A $d$ dimensional space containing features. Inputs can be Numeric (age), Ordinal (low/medium/high), or Categorical (car brands).
Output Space ( $Y$ ): The target values (e.g., house prices for regression or categories for classification).
Training Set ( $T$ ): A collection of $N$ input-output pairs: $T = {(x^{(1)}, y^{(1)}), \dots, (x^{(N)}, y^{(N)})}$ .

The Design Matrix

To process data efficiently, all input vectors from the training set are often organised into a Design Matrix.

Each row represents one training example $x^{(i)}$ .
Each column represents a specific independent variable (feature).
A “bias” column of 1s is often added as the first column $(x_{0} = 1)$ to account for the intercept term $w_{0}$ .

Design matrix X = x_{1}^{(1)} x_{1}^{(2)} ⋮ x_{1}^{(N)} x_{2}^{(1)} x_{2}^{(2)} ⋮ x_{2}^{(N)} \dots \dots ⋱ \dots x_{d}^{(1)} x_{d}^{(2)} ⋮ x_{d}^{(N)}

Vector independent variables y = y^{(1)} y^{(2)} ⋮ y^{(N)}

Logistic Regression: Hypothesis Set

Despite its name, Logistic Regression is for classification, specifically binary classification $(Y = {0, 1})$ .
From Linear Scores to Probabilities

The Score: We calculate a linear combination of inputs $z = w^{T} x$ .
The Problem: $w^{T} x$ is unbounded ( $- \infty$ to $\infty$ ), but the probabilities must be $[0, 1]$
The Solution (Logit): We model the logit (log-odds) as the linear combination:

logit (p_{1}) = ln (\frac{p _{1}}{1 - p _{1}}) = w^{T} x

The Activation (Sigmoid): Solving for $p_{1}$ gives us the Sigmoid function which Squashes the score into a probability:

p_{1} = σ (w^{T} x) = \frac{e ^{w^{T} x}}{1 + e ^{w^{T} x}} = \frac{1}{1 + e ^{- w^{T} x}}

Decision Boundary

The model predicts Class 1 is $p_{1} \geq 0.5$ (which occurs when $w^{T} x \geq 0$ ).
The model predicts Class 0 if $p_{1} < 0.5$ (which occurs when $w^{T} x < 0$ ).
Decision Boundary: The hyperplane is defined by $w^{T} x = 0$ .

Tip

Distance and Confidence: The larger the absolute value of the score $∣ w^{T} x ∣$ , the further the point is from the decision boundary, and the higher the model’s confidence in its pediction.

Ayush Acharjya's Notes

Explorer

01 Supervised Learning

Core Data Structures

The Design Matrix

Logistic Regression: Hypothesis Set

Decision Boundary

Graph View

Table of Contents

Backlinks