27 Neural Language Models and Transformers

Representing Language for Neural Networks

Neural networks cannot interpret raw text, they require numerical inputs.

Indexing: Assigning a unique integer to every word in a vocabulary
One-hot Embedding: Representing a word as a sparse vector with a $1$ at its index and $0$ elsewhere.
Learned Embedding (Word2Vec): Transforming indexes into fixed-size, dense vectors where spatial proximity represents meaning.
- Semantic Relationships: Words with similar meanings (e.g., “Woman” and “Man”) occupy similar relative positions.
- Syntactic Relationships: Grammatical patterns (e.g., “Small” to “Smallest”) are captured through vector offsets

Probabilistic Language Modelling

The Markov Assumption

Predicting the next word $w_{n}$ based on the entire history of $w_{1} \dots w_{n - 1}$ is computationally difficult. The Markov assumption simplifies this by looking only at the recent past

The N-gram Formula

P (w_{n} ∣ w_{1 : n - 1}) \approx P (w_{n} ∣ w_{n - N + 1 : n - 1})

Bigrams (N=2): Predicts the next word based on only one previous word.
Trigrams (N=3): Predicts based on two previous words.
Sequence Probability: The total probability of a sequence is the product of these conditional probabilities: $P (w_{1 : n}) \approx \prod_{k = 1}^{n} P (w_{k} ∣ w_{k - 1})$ .

The Shift to Transformer

Traditional architectures like Convolutional Neural Networks (CNNs) struggle with language because dependencies (like a subject an its verb) can be separated by an arbitrary number of words.

Note

If asked why Transformers replaced RNNs or CNNs for NLP, focus on the “long-range dependency” problem. CNNs have a fixed receptive field, whereas Transformers can relate any two words regardless of distance

The Transformer Architecture

The Transformer is a conditional generative model consisting of two main stacks:

Encoder: Processes the input sequence into a Latent Code.
Decoder: Uses the latent code to generate an output sequence (e.g., translating “Je suis étudiant” to “I am a student”)

Key Components

Positional Encoding: Since Transformers process all tokens simultaneously (rather than one-by-one), they need positional encodings to understand word order.
Multi-Layer Perceptron (MLP): Standard feed-forward layers that process token after attention
Latent Code: The compressed representation of the input passed from the encoder to the decoder.

The Attention Mechanism

Attention allows the model to “focus” on specific parts of the input when producing an output.

The Query, Key, Value (Q, K, V) Intuition

Think of it like a Python Dictionary or a JSON File:

Query (Q): What you are looking for (e.g., “Date of birth”).
Key (K): The labels in the database (e.g., “Name”, “Address”, “Date of birth”).
Value (V): The actual information associated with the key (e.g., “May 5th 2000”).

Self-Attention Formula

The model calculates a weight (similarity) between the Query and all Keys, that uses that weight to sum the Values.

Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V

$Q K^{T}$ : Computes the similarity score between queries and keys.
$d_{k}$ : A scaling factor to prevent gradients from exploding.
Softmax: Normalises the scores into a probability distribution that sums to 1.

Multi-Head Attention

Instead of one attention calculation, the model runs multiple “heads” in parallel

Intuition: Each head can learn different types of relationships (e.g., one head for grammar, another for context).
Process: The outputs ( $z_{i}$ ) of all heads are concatenated and projected through a learned weight matrix $W^{O}$ to produce the final result.

Autoregressive Generative Models

These models treat data (text, audio or images) as a sequence and factorise the joint probability distribution.

Factorisation Formula

p_{θ} (x) = i = 1 \prod n p_{θ} (x_{i} ∣ x_{1}, \dots, x_{i - 1})

Examples

WaveNet: A generative model for raw audio that predicts the next audio sample based on previous ones using causal convolutions.
PixelRNN: Generates images pixel-by-pixel, where each pixel’s value depends on all previously generated pixels.
Transformer Decoder: Uses Masked Attention to ensure that while generating word $i$ , it can only “see” words $1 \dots i - 1$ .

Ayush Acharjya's Notes

Explorer

27 Neural Language Models and Transformers

Representing Language for Neural Networks

Probabilistic Language Modelling

The Markov Assumption

The N-gram Formula

The Shift to Transformer

The Transformer Architecture

Key Components

The Attention Mechanism

The Query, Key, Value (Q, K, V) Intuition

Self-Attention Formula

Multi-Head Attention

Autoregressive Generative Models

Factorisation Formula

Examples

Graph View

Table of Contents

Backlinks