Book 163 - Burny

Here's an expanded map with more details, equations, and extensions: Neural Network Architectures - Transformer Architecture - Self-Attention Mechanism - Scaled Dot-Product Attention: Attention(Q, K, V) = softmax(QKᵀ / √d_k)V - Multi-Head Attention: MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O, where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) - Positional Encoding - Sinusoidal Positional Encoding: PE_(pos, 2i) = sin(pos/10000^(2i/d_model)), PE_(pos, 2i+1) = cos(pos/10000^(2i/d_model)) - Learned Positional Encoding - Layer Normalization: LN(x) = (x - μ) / √(σ^2 + ε) * γ + β - Residual Connections: h(x) = F(x) + x - Feed-Forward Networks: FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 - Gated Linear Units (GLU): GLU(x) = (xW_1 + b_1) ⊙ σ(xW_2 + b_2) - Variants and Extensions - BERT (Bidirectional Encoder Representations from Transformers) - Masked Language Modeling (MLM): p(x_i|x_{\i}) = softmax(x_iW + b) - Next Sentence Prediction (NSP) - GPT (Generative Pre-trained Transformer) - Causal Language Modeling: p(x_t|x_1, ..., x_{t-1}) = softmax(h_tW + b) - T5 (Text-to-Text Transfer Transformer) - Encoder-Decoder Architecture - Unified Text-to-Text Format - XLNet (Generalized Autoregressive Pretraining) - Permutation Language Modeling: max_θ E_z~Z_T [Σ_t log p(x_z(t)|x_z(1), ..., x_z(t-1))] - ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) - Replaced Token Detection: p(x_i^r|x_{\i}^r) = σ(h_i^ᵀw + b) - Longformer (Long-Document Transformer) - Attention with Linear Complexity - Sliding Window Attention - Global Attention - Reformer (Efficient Transformer) - Local Sensitive Hashing (LSH) Attention - Reversible Residual Layers - Sparse Transformers - Sparse Attention Patterns - Factorized Self-Attention: Attention(Q, K, V) = softmax(QS_1S_1^ᵀKᵀ)VS_2S_2^ᵀ Optimization Techniques - Gradient Descent - Stochastic Gradient Descent (SGD): θ_{t+1} = θ_t - η * ∇J(θ_t) - Mini-Batch Gradient Descent - Momentum: v_t = γv_{t-1} + η∇J(θ), θ_{t+1} = θ_t - v_t - Nesterov Accelerated Gradient (NAG): v_t = γv_{t-1} + η∇J(θ - γv_{t-1}), θ_{t+1} = θ_t - v_t - Adaptive Optimization Methods - AdaGrad: θ_{t+1, i} = θ_{t, i} - (η / √(G_{t, ii} + ε)) * g_{t, i} - RMSprop: E[g^2]_t = βE[g^2]_{t-1} + (1 - β)g_t^2, θ_{t+1} = θ_t - (η / √(E[g^2]_t + ε)) * g_t - Adam: m_t = β_1 * m_{t-1} + (1 - β_1) * g_t, v_t = β_2 * v_{t-1} + (1 - β_2) * g_t^2, θ_t = θ_{t-1} - α * m_t / (√v_t + ε) - AdamW (Adam with Weight Decay): m_t = β_1 * m_{t-1} + (1 - β_1) * g_t, v_t = β_2 * v_{t-1} + (1 - β_2) * g_t^2, θ_t = θ_{t-1} - α * (m_t / (√v_t + ε) + λθ_{t-1}) - Learning Rate Scheduling - Step Decay: η_t = η_0 * γ^⌊t/s⌋ - Exponential Decay: η_t = η_0 * γ^t - 1/t Decay: η_t = η_0 / (1 + kt) - Cosine Annealing: η_t = η_min + (1/2)(η_max - η_min)(1 + cos(tπ/T)) - Linear Warmup: η_t = η_0 * min(1, t/T_w) - Inverse Square Root Schedule: η_t = η_0 * (1/√t) Regularization Techniques - Dropout - Standard Dropout: h_i^(l+1) = r_i * h_i^l, where r_i ~ Bernoulli(p) - Gaussian Dropout: h_i^(l+1) = (1 + ε_i) * h_i^l, where ε_i ~ N(0, σ^2) - Variational Dropout - Weight Decay (L2 Regularization): L(θ) = L_0(θ) + (λ/2)||θ||_2^2 - Label Smoothing: q'(k|x) = (1 - ε) * δ_{k,y} + ε / K - Early Stopping - Mixup: x' = λx_i + (1 - λ)x_j, y' = λy_i + (1 - λ)y_j, where λ ~ Beta(α, α) Loss Functions - Cross-Entropy Loss: L(y, ŷ) = -Σ_{i=1}^{n} y_i * log(ŷ_i) - Focal Loss: FL(p_t) = -α_t(1 - p_t)^γ * log(p_t) - Contrastive Loss - InfoNCE: L_N = -E_{(x, y) ~ D}[log(exp(f(x)ᵀg(y)) / Σ_{y' ~ D}exp(f(x)ᵀg(y')))] - NT-Xent (Normalized Temperature-Scaled Cross-Entropy) - Knowledge Distillation Loss: L_KD = α * H(y, y_teacher) + (1 - α) * H(y, y_student) Evaluation Metrics - Perplexity: PP(p) = 2^(-Σ_{x∈X} p(x) * log_2(q(x))) - BLEU Score: BLEU = BP * exp(Σ_{n=1}^{N} w_n * log(p_n)) - ROUGE Score - ROUGE-N: ROUGE-N = (Σ_{S∈{ReferenceSummaries}} Σ_{gram_n∈S} Count_match(gram_n)) / (Σ_{S∈{ReferenceSummaries}} Σ_{gram_n∈S} Count(gram_n)) - ROUGE-L: ROUGE-L = (2 * P * R) / (P + R) - Exact Match and F1 Score - METEOR: METEOR = F_mean * (1 - Penalty) - chrF (Character n-gram F-score) - TER (Translation Edit Rate) - BERT Score Tokenization and Subword Units - Byte Pair Encoding (BPE) - WordPiece - SentencePiece - Unigram Language Model - Subword Regularization Embedding Techniques - Word Embeddings - Word2Vec - Skip-Gram: max_θ Σ_{t=1}^{T} Σ_{-c≤j≤c, j≠0} log p(w_{t+j}|w_t) - Continuous Bag-of-Words (CBOW): max_θ Σ_{t=1}^{T} log p(w_t|w_{t-c}, ..., w_{t+c}) - GloVe: min_{W, \tilde{W}, b, \tilde{b}} Σ_{i,j=1}^{V} f(X_{ij})(w_i^ᵀ\tilde{w}_j + b_i + \tilde{b}_j - log X_{ij})^2 - FastText - Contextual Embeddings - ELMo (Embeddings from Language Models) - BERT (Bidirectional Encoder Representations from Transformers) - RoBERTa (Robustly Optimized BERT Pretraining Approach) - XLNet (Generalized Autoregressive Pretraining) - ALBERT (A Lite BERT) - Character Embeddings - CharCNN - Flair Embeddings - Sentence Embeddings - Sent2Vec - InferSent - Universal Sentence Encoder Attention Variants - Sparse Attention - Relative Position Representations - Locality-Sensitive Hashing (LSH) Attention - Reformer (Efficient Transformer) - Linformer (Linear Transformer) - Longformer (Long-Document Transformer) - Big Bird (Transformers for Longer Sequences) - Sinkhorn Attention - Synthesizer (Synthesizing Programs for Images using Reinforced Adversarial Learning) - Routing Transformer Mixture of Experts (MoE) Variants - Switch Transformer - Routing Function: r_i(x) = softmax(W_i * x + b_i) - Expert Selection: E(x) = Σ_{i=1}^{N} r_i(x) * E_i(x) - GShard (Scaling Language Models with Sparse Mixture-of-Experts) - Expert Allocation: p_i = softmax(h_i / τ) - Load Balancing Loss: L_lb = Σ_{i=1}^{N} p_i * c_i, where c_i = a * (m / n_i)^b - BASE Layers (Bottleneck Attention-Supervision Experts) - Supervision Attention: A = softmax(QK^T / √d_k) - Expert Computation: E_i(x) = FFN_i(x) - Hash Layers - Hash-based Routing: h(x) = hash(x) % N - Expert Computation: E_i(x) = FFN_i(x) if i = h(x), else 0 Efficient Implementations - Mixed-Precision Training - FP16, BFloat16 - Quantization - INT8, INT4, Binary - Pruning - Magnitude Pruning - Structured Pruning - Knowledge Distillation - DistilBERT, TinyBERT, MobileBERT - Model Parallelism - Tensor Parallelism - Pipeline Parallelism - Data Parallelism - Distributed Training - Zero Redundancy Optimizer (ZeRO) - Gradient Checkpointing - Reversible Layers This expanded map includes more details on neural network architectures, optimization techniques, regularization methods, loss functions, evaluation metrics, tokenization approaches, embedding methods, attention variants, and efficient implementations. It covers a wide range of equations and extensions used in state-of-the-art transformer-based language models. However, please note that this is still not an exhaustive list, as the field of natural language processing and deep learning is vast and rapidly evolving.