Representing Language for Neural Networks

Neural networks cannot interpret raw text, they require numerical inputs.

  • Indexing: Assigning a unique integer to every word in a vocabulary
  • One-hot Embedding: Representing a word as a sparse vector with a at its index and elsewhere.
  • Learned Embedding (Word2Vec): Transforming indexes into fixed-size, dense vectors where spatial proximity represents meaning.
    • Semantic Relationships: Words with similar meanings (e.g., “Woman” and “Man”) occupy similar relative positions.
    • Syntactic Relationships: Grammatical patterns (e.g., “Small” to “Smallest”) are captured through vector offsets

Probabilistic Language Modelling

The Markov Assumption

Predicting the next word based on the entire history of is computationally difficult. The Markov assumption simplifies this by looking only at the recent past

The N-gram Formula

  • Bigrams (N=2): Predicts the next word based on only one previous word.
  • Trigrams (N=3): Predicts based on two previous words.
  • Sequence Probability: The total probability of a sequence is the product of these conditional probabilities: .

The Shift to Transformer

Traditional architectures like Convolutional Neural Networks (CNNs) struggle with language because dependencies (like a subject an its verb) can be separated by an arbitrary number of words.

Note

If asked why Transformers replaced RNNs or CNNs for NLP, focus on the “long-range dependency” problem. CNNs have a fixed receptive field, whereas Transformers can relate any two words regardless of distance

The Transformer Architecture

The Transformer is a conditional generative model consisting of two main stacks:

  • Encoder: Processes the input sequence into a Latent Code.
  • Decoder: Uses the latent code to generate an output sequence (e.g., translating “Je suis étudiant” to “I am a student”)

Key Components

  • Positional Encoding: Since Transformers process all tokens simultaneously (rather than one-by-one), they need positional encodings to understand word order.
  • Multi-Layer Perceptron (MLP): Standard feed-forward layers that process token after attention
  • Latent Code: The compressed representation of the input passed from the encoder to the decoder.

The Attention Mechanism

Attention allows the model to “focus” on specific parts of the input when producing an output.

The Query, Key, Value (Q, K, V) Intuition

Think of it like a Python Dictionary or a JSON File:

  1. Query (Q): What you are looking for (e.g., “Date of birth”).
  2. Key (K): The labels in the database (e.g., “Name”, “Address”, “Date of birth”).
  3. Value (V): The actual information associated with the key (e.g., “May 5th 2000”).

Self-Attention Formula

The model calculates a weight (similarity) between the Query and all Keys, that uses that weight to sum the Values.

  • : Computes the similarity score between queries and keys.
  • : A scaling factor to prevent gradients from exploding.
  • Softmax: Normalises the scores into a probability distribution that sums to 1.

Multi-Head Attention

Instead of one attention calculation, the model runs multiple “heads” in parallel

  • Intuition: Each head can learn different types of relationships (e.g., one head for grammar, another for context).
  • Process: The outputs () of all heads are concatenated and projected through a learned weight matrix to produce the final result.

Autoregressive Generative Models

These models treat data (text, audio or images) as a sequence and factorise the joint probability distribution.

Factorisation Formula

Examples

  • WaveNet: A generative model for raw audio that predicts the next audio sample based on previous ones using causal convolutions.
  • PixelRNN: Generates images pixel-by-pixel, where each pixel’s value depends on all previously generated pixels.
  • Transformer Decoder: Uses Masked Attention to ensure that while generating word , it can only “see” words .