Autoregressive (AR) Models
Autoregressive models interpret complex, high-dimensional data (like images or audio) as a sequence of variables. Instead of modelling the entire distribution directly, we factorise it into a product of conditional probabilities.
The Factorisation Formula
For a sequence of , the joint probability is
Core Characteristics
- Training: A neural network with parameters is trained to predict the distribution of the next element given all previous elements .
- Generation: Elements are sampled one by one. Each new sample is appended to the input to predict the next one
- Pros/Cons: These models can accurately compute probabilities, but sampling is slow because it must be done sequentially.
Tip
Example Architecture
- WaveNet: A fully convolutional architecture for raw audio generation.
- PixelRNN/PixelCNN: Models that generate images pixel by pixel.
Language Modeling (LM)
A Language Model is a system that predicts upcoming words in a sequence. It can assign a probability to a specific word or an entire body of text.
The Estimation Problem
Mathematically, the goal is to compute .
- The Naive Approach: We could try to “count and divide” (Maximum Likelihood Estimation) using a massive corpus.
- The Problem: Language is creative. Most sentences are unique and will never appear in a training set, meaning we cannot get accurate counts for entire sentences (the sparsity or zeros problem).
N-gram Models & The Markov Assumption
To solve the sparsity problem, we use the Markov Assumption: the probability of a word is approximated by looking only at a short history of preceding words.
N-gram Types
- Unigram: Predicts words based on their individual frequency, ignoring all context.
- Bigram : Approximates the probability of a word given only the single preceding word.
- N-gram: Looks back words:
Estimating Probabilities (MLE)
For a bigram, the probability is estimated by counting the occurrences of the pair in a corpus and diving by the count of the first word.
Evaluation: Perplexity
Perplexity (PP) measures how well a language model predicts a test set. It is the inverse probability of the test set, normalised by the number of words.
Formula
Intuition
- Lower perplexity means the model is less surprised by the test data and thus has a better “grasp” of the language
- Minimising perplexity is mathematically equivalent to maximising the probability of the sequence
- Typical Values: For the Wall Street Journal corpus, perplexity drops significantly as increases (Unigram: 962 Bigram: 170 Trigram: 109).
Generating Text: Sampling Strategies
Once a model is trained, we use various algorithms to select the next word from the predicted distribution .
- Greedy Search: Always choose the word with the absolute highest probability. It is fast but can lead to repetitive or sub-optimal sequence.
- Beam Search: Keeps track of the top most likely sequences (hypothesis) at each step. This often finds better global sequences than greedy search
- Sampling with Temperature : Adjusts the softmax distribution to control “creativity”.
- High makes the distribution flatter (more diverse/random).
- Low makes it “peakier” (more confident/conservative).
- Top-K Sampling: Only samples from the most likely next words
- Top-P (Nucleus) Sampling: Samples from the smallest set of words whose cumulative probability exceeds a threshold .
Neural Language Models
Traditional N-grams suffer from two main flaws:
- Long-distance dependencies: They cannot remember context from several sentences ago.
- Similarity: They don’t understand that “ate breakfast” and “ate lunch” are semantically similiar.
Neural Language Models use Word Embeddings (vector in a continuous space) to represent meanings. This allows them to model synonyms effectively and handle much longer context through architectures like RNNs, LSTMs, and Transformers.