Autoregressive (AR) Models

Autoregressive models interpret complex, high-dimensional data (like images or audio) as a sequence of variables. Instead of modelling the entire distribution directly, we factorise it into a product of conditional probabilities.

The Factorisation Formula

For a sequence of , the joint probability is

Core Characteristics

  • Training: A neural network with parameters is trained to predict the distribution of the next element given all previous elements .
  • Generation: Elements are sampled one by one. Each new sample is appended to the input to predict the next one
  • Pros/Cons: These models can accurately compute probabilities, but sampling is slow because it must be done sequentially.

Tip

Example Architecture

  • WaveNet: A fully convolutional architecture for raw audio generation.
  • PixelRNN/PixelCNN: Models that generate images pixel by pixel.

Language Modeling (LM)

A Language Model is a system that predicts upcoming words in a sequence. It can assign a probability to a specific word or an entire body of text.

The Estimation Problem

Mathematically, the goal is to compute .

  • The Naive Approach: We could try to “count and divide” (Maximum Likelihood Estimation) using a massive corpus.
  • The Problem: Language is creative. Most sentences are unique and will never appear in a training set, meaning we cannot get accurate counts for entire sentences (the sparsity or zeros problem).

N-gram Models & The Markov Assumption

To solve the sparsity problem, we use the Markov Assumption: the probability of a word is approximated by looking only at a short history of preceding words.

N-gram Types

  • Unigram: Predicts words based on their individual frequency, ignoring all context.
  • Bigram : Approximates the probability of a word given only the single preceding word.
  • N-gram: Looks back words:

Estimating Probabilities (MLE)

For a bigram, the probability is estimated by counting the occurrences of the pair in a corpus and diving by the count of the first word.

Evaluation: Perplexity

Perplexity (PP) measures how well a language model predicts a test set. It is the inverse probability of the test set, normalised by the number of words.

Formula

Intuition

  • Lower perplexity means the model is less surprised by the test data and thus has a better “grasp” of the language
  • Minimising perplexity is mathematically equivalent to maximising the probability of the sequence
  • Typical Values: For the Wall Street Journal corpus, perplexity drops significantly as increases (Unigram: 962 Bigram: 170 Trigram: 109).

Generating Text: Sampling Strategies

Once a model is trained, we use various algorithms to select the next word from the predicted distribution .

  1. Greedy Search: Always choose the word with the absolute highest probability. It is fast but can lead to repetitive or sub-optimal sequence.
  2. Beam Search: Keeps track of the top most likely sequences (hypothesis) at each step. This often finds better global sequences than greedy search
  3. Sampling with Temperature : Adjusts the softmax distribution to control “creativity”.
    • High makes the distribution flatter (more diverse/random).
    • Low makes it “peakier” (more confident/conservative).
  4. Top-K Sampling: Only samples from the most likely next words
  5. Top-P (Nucleus) Sampling: Samples from the smallest set of words whose cumulative probability exceeds a threshold .

Neural Language Models

Traditional N-grams suffer from two main flaws:

  • Long-distance dependencies: They cannot remember context from several sentences ago.
  • Similarity: They don’t understand that “ate breakfast” and “ate lunch” are semantically similiar.

Neural Language Models use Word Embeddings (vector in a continuous space) to represent meanings. This allows them to model synonyms effectively and handle much longer context through architectures like RNNs, LSTMs, and Transformers.