03 Nonlinear Transformation in Machine Learning

The Need for Nonlinear Transformations

Standard linear models, like basic Logistic Regression, assume that a straight line (a linear decision boundary where $w^{T} x = 0$ ) can separate classes.
In many real-world scenarios, classes are nonlinearly separable. Forcing a straight line on such data leads to underfitting, where the model is too simple to capture the underlying structure. By transforming the data, we can use the power of linear algorithms to solve nonlinear problems.

Feature Mapping & Basis Functions

To solve nonlinear problems, we apply a transformation function $ϕ (x)$ to the input vector $x$ . This creates a new feature space.

Polynomial Basis Expansion

A common transformation is to create polynomials of degree $p$ . This expansion includes all possible terms up to that degree based on the original variables.

Example: Degree 2 Expansion (1 Input Variable)
- Original: $x = (1, x_{1})^{T}$
- Transformed: $ϕ (x) = (1, x_{1}, x_{1}^{2})^{T}$
Example: Degree 2 Expansion (2 Input Variables)
- Original: $x = (1, x_{1}, x_{2})^{T}$
- Transformed: $ϕ (x) = (1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})^{T}$

Tip

Key Insight: A linear decision boundary in the high-dimensional feature space $(w^{T} ϕ (x) = 0)$ translates back to a nonlinear decision boundary (like a parabola or circle) in the original input spac.

Non-Polynomial Transformations

Transformations are not limited to polynomials. You can use any nonlinear function that helps separate the data, such as exponential functions:

x = (1, x_{1}) \to ϕ (x) = (1, x_{1}, e^{x_{1}})^{T}

Implementation in Logistic Regression

When using transformation, the core logic of Logistic Regression remains the same, but every instance of $x$ is replaced by $ϕ (x)$ .

Mathematical Adaptation:

Transformed Logit: $logit (p_{1}) = w^{T} ϕ (x)$
Probability Model: $p (1∣ ϕ (x), w) = \frac{e ^{w^{T} ϕ (x)}}{1 + e ^{w^{T} ϕ (x)}}$

Optimisation Formulas

To find the optimal weights $w$ , we minimise the error function $E (w)$ using the transformed training set $T = {(ϕ (x^{(i)}), y^{(i)})}_{i = 1}^{N}$

Gradient: $\nabla E (w) = \sum_{i = 1}^{N} (p (1∣ ϕ (x^{(i)}), w) - y^{(i)}) ϕ (x^{(i)})$
Hessian: $H E (w) = \sum_{i = 1}^{N} p (1∣ ϕ (x^{(i)}), w) (1 - p (1∣ ϕ (x^{(i)}), w)) ϕ (x^{(i)}) ϕ (x^{(i)})^{T}$

The Linearity Paradox

Is a model with nonlinear transformation still a “linear model” ?
Yes. In machine learning the “linearity” of a model typically refers to its parameters $w$ and not the input variables $x$ .

The model is nonlinear in the input space (it draws curves).
The model is linear in the feature space (it draws a straight line in the embedding).
The model is linear in its parameters (weights are not squared or used in exponents).

Advantages and Risks

Advantages

Efficiency: We can continue using well-understood and fast linear learning algorithms.
Robustness: Linear models often have strong generalisation properties.

Caveats & Common Mistakes

Overfitting: As the polynomial degree $p$ increases, the model gains more “power” and might start fitting to noise in the training data rather than the actual trend.
Dimensionality Explosion: The number of features in $ϕ (x)$ can grow very quickly as you increase the number of input variables and the degree of expansion, making calculations slower.

Step-by-Step Adoption Process

Define $ϕ (x)$ : Choose the basis functions (e.g., quadratic expansion).
Data Transformation: Apply $ϕ (x)$ to all training samples $x^{(i)}$ to create a new dataset of $(ϕ (x), y)$ pairs.
Train Linear Model: Use standard algorithms (like Gradient Descent) to find weights $w$ for the transformed data.
Predict: To classify a new point, first transform it to $ϕ (x)$ , then apply the linear model $w^{T} ϕ (x)$ .

Ayush Acharjya's Notes

Explorer

03 Nonlinear Transformation in Machine Learning

The Need for Nonlinear Transformations

Feature Mapping & Basis Functions

Polynomial Basis Expansion

Non-Polynomial Transformations

Implementation in Logistic Regression

Mathematical Adaptation:

Optimisation Formulas

The Linearity Paradox

Advantages and Risks

Step-by-Step Adoption Process

Graph View

Table of Contents

Backlinks