26 Autoregressive Models & Language Modelling

Autoregressive (AR) Models

Autoregressive models interpret complex, high-dimensional data (like images or audio) as a sequence of variables. Instead of modelling the entire distribution $p (x)$ directly, we factorise it into a product of conditional probabilities.

The Factorisation Formula

For a sequence of $x = (x_{1}, x_{2}, \dots, x_{n})$ , the joint probability is

p (x) = i = 1 \prod n p (x_{i} ∣ x_{1}, \dots, x_{i - 1})

Core Characteristics

Training: A neural network with parameters $θ$ is trained to predict the distribution of the next element $x_{i}$ given all previous elements $x_{1}, \dots, x_{i - 1}$ .
Generation: Elements are sampled one by one. Each new sample is appended to the input to predict the next one
Pros/Cons: These models can accurately compute probabilities, but sampling is slow because it must be done sequentially.

Tip

Example Architecture

WaveNet: A fully convolutional architecture for raw audio generation.

PixelRNN/PixelCNN: Models that generate images pixel by pixel.

Language Modeling (LM)

A Language Model is a system that predicts upcoming words in a sequence. It can assign a probability to a specific word or an entire body of text.

The Estimation Problem

Mathematically, the goal is to compute $P (W) = P (w_{1}, w_{2}, \dots, w_{n})$ .

The Naive Approach: We could try to “count and divide” (Maximum Likelihood Estimation) using a massive corpus.
The Problem: Language is creative. Most sentences are unique and will never appear in a training set, meaning we cannot get accurate counts for entire sentences (the sparsity or zeros problem).

N-gram Models & The Markov Assumption

To solve the sparsity problem, we use the Markov Assumption: the probability of a word is approximated by looking only at a short history of $N - 1$ preceding words.

N-gram Types

Unigram: Predicts words based on their individual frequency, ignoring all context.
Bigram $(N = 2)$ : Approximates the probability of a word given only the single preceding word.

P (w_{n} ∣ w_{1 : n - 1}) \approx P (w_{n} ∣ w_{n - 1})

N-gram: Looks back $N - 1$ words:

P (w_{n} ∣ w_{1 : n - 1}) \approx P (w_{n} ∣ w_{n - N + 1 : n - 1})

Estimating Probabilities (MLE)

For a bigram, the probability is estimated by counting the occurrences of the pair in a corpus and diving by the count of the first word.

P (w_{i} ∣ w_{i - 1}) = \frac{C ( w _{i - 1} , w _{i} )}{C ( w _{i - 1} )}

Evaluation: Perplexity

Perplexity (PP) measures how well a language model predicts a test set. It is the inverse probability of the test set, normalised by the number of words.

Formula

PP (W) = N i = 1 \prod N \frac{1}{P ( w _{i} ∣ w _{1} \dots w _{i - 1} )}

Intuition

Lower perplexity means the model is less surprised by the test data and thus has a better “grasp” of the language
Minimising perplexity is mathematically equivalent to maximising the probability of the sequence
Typical Values: For the Wall Street Journal corpus, perplexity drops significantly as $N$ increases (Unigram: 962 $\to$ Bigram: 170 $\to$ Trigram: 109).

Generating Text: Sampling Strategies

Once a model is trained, we use various algorithms to select the next word from the predicted distribution $P (w ∣ hi s t ory)$ .

Greedy Search: Always choose the word with the absolute highest probability. It is fast but can lead to repetitive or sub-optimal sequence.
Beam Search: Keeps track of the top $k$ most likely sequences (hypothesis) at each step. This often finds better global sequences than greedy search
Sampling with Temperature $(T)$ : Adjusts the softmax distribution to control “creativity”.
- High $T$ makes the distribution flatter (more diverse/random).
- Low $T$ makes it “peakier” (more confident/conservative).
Top-K Sampling: Only samples from the $K$ most likely next words
Top-P (Nucleus) Sampling: Samples from the smallest set of words whose cumulative probability exceeds a threshold $P$ .

Neural Language Models

Traditional N-grams suffer from two main flaws:

Long-distance dependencies: They cannot remember context from several sentences ago.
Similarity: They don’t understand that “ate breakfast” and “ate lunch” are semantically similiar.

Neural Language Models use Word Embeddings (vector in a continuous space) to represent meanings. This allows them to model synonyms effectively and handle much longer context through architectures like RNNs, LSTMs, and Transformers.

Ayush Acharjya's Notes

Explorer

26 Autoregressive Models & Language Modelling

Autoregressive (AR) Models

The Factorisation Formula

Core Characteristics

Example Architecture

Language Modeling (LM)

The Estimation Problem

N-gram Models & The Markov Assumption

N-gram Types

Estimating Probabilities (MLE)

Evaluation: Perplexity

Formula

Intuition

Generating Text: Sampling Strategies

Neural Language Models

Graph View

Table of Contents

Backlinks