NATURAL LANGUAGE PROCESSING: COURSE PROGRAM ACADEMIC YEAR 2023-2024 What follows is a subset of the topics that have been presented in class, that will be matter of evaluation at the finals. The course textbook is 'Speech and Language Processing' by Dan Jurafsky and James H. Martin, 3rd Edition, draft from February 3rd, 2024, freely available on the web. For a few topics, you need to look at the auxiliary textbook 'Introduction to Natural Language Processing' by Jacob Eisenstein, October 2019, MIT Press, preprint version freely available on the web, and you need to look at the on-line course 'NLP Course | For You' by Elena Voita, University of Edinburgh. Lecture 01: Natural language processing: An unexpected journey. Content: What is natural language processing? A few case studies: finance, social networks, health. Short history of natural language. Why is natural language processing tricky? Ambiguity, composition, recursion and hidden structure. How does natural language processing work? A very short history of natural language processing. Learning & knowledge. Search & learning. Market, environment and ethics. References: Slides from the lecture; search & learning is from Eisenstein, section 1.2.2. Lecture 02: Essentials of linguistics. Content: What is linguistics? Phonology. Morphology: inflectional and derivational morphology. Syntax: Part of speech tags, phrase structure trees and dependency trees. Lexical semantics and general semantics. Principle of compositionality. Pragmatics, discourse and dialogue. References: Slides from the lecture. Lecture 03: Text normalization. Content: Regular expressions and extended regular expressions, substitution and back-reference. Words, tokens, types and vocabulary. Herdan/Heaps law and Zipf/Mandelbrot law. Word-forms and lemmas; multi-element word-forms. Corpora. Text normalization: language identification, spell checker, contraction, punctuation and special characters. Text tokenization: word tokenization, character tokenization, and subword tokenization. Subword tokenization: learning algorithm, encoder and decoder. Byte-pair encoding: algorithm ad examples. WordPiece. Sentence segmentation and case folding. Stop words, stemming and lemmatization. References: Jurafsky & Martin, chapter 2; skip section 2.8. I take it for granted that you already know about regular expressions, which is presented in section 2.1. Lecture 04: Words and meaning. Content: Lexical semantics: word senses and word relationships. Distributional semantics. Review of vectors: vector length, vector normalization and cosine similarity. Vector semantics and term-context matrix. Pointwise mutual information (PMI) and positive pointwise mutual information (PPMI). Probability estimation using word frequency. Practical issues. Truncated singular value decomposition. Neural static word embeddings. Word2vec and skip-gram with negative sampling: target embedding, context embedding, classifier algorithm derived from logistic regression and training algorithm. Practical issues. Other kinds of static embeddings: FastText and Glove. Visualizing word embeddings. Semantic properties of word embeddings. Bias and word embeddings. Evaluation of word embeddings. Cross-lingual word embeddings. References: Jurafsky & Martin, chapter 6; skip sections 6.3.1, 6.3.2, 6.5, 6.7, and equations (6.35)-(6.40). Truncated singular value decomposition is taken from Eisenstein, section 14.3. Some of the topics under practical issues have been taken from the on-line course 'NLP Course | For You' by Elena Voita. Use lecture slides for cross-lingual word embeddings. Lecture 05: Language models. Content: Language modeling and applications. Relative frequency estimation. N-gram model, N-gram probabilities and bias-variance tradeoff. Practical issues. Evaluation: perplexity measure and practical issues. Sampling sentences. Sparse data: Laplace smoothing and add-k smoothing; stupid backoff and linear interpolation; out-of-vocabulary words. Limitations of N-gram model. Neural language models: general architecture. Feedforward NLM: inference and training. Recurrent NLM: inference and training. Character level and character-aware NLM. Practical issues: weight tying, adaptive softmax, softmax temperature, contrastive evaluation. References: Jurafsky & Martin, chapter 3; skip sections 3.7, 3.8. General architecture for NN language models has been taken from the on-line course 'NLP Course | For You' by Elena Voita, section Language Modeling. Feed-forward NN for LM is in Jurafsky & Martin, sections 7.5 and 7.7. RNN for LM is in Jurafsky & Martin, section 9.2. Lecture 06: Large language models Content: Review of transformers. Static embeddings vs. contextualized embeddings. ELMo architecture. BERT architecture; masked language modeling and next sentence prediction. Large language models (LLMs). GPT-n family. Other LLMs; open-source LLMs. LLMs classification: encoder, decoder and encoder-decoder. Multi-lingual LLMs: monolingual training and training based on parallel corpora. Sentence BERT, training and inference. Miscellanea: emergent abilities, hallucinations, and mixture of experts. Adaptation: feature extraction and fine-tuning. Parameter efficient fine-tuning: adapters and LoRA. Transfer learning. Prompt learning. Retrieval-augmented generation. Contextualized embeddings and ethics. References: Jurafsky & Martin, chapter 11; skip sections 11.3, 11.4.3, 11.5. ELMo and GPT-n models, adapters and transfer learning have all been taken from the on-line course 'NLP Course | For You' by Elena Voita, section Transfer Learning. Use lecture slides for the following topics: other LLMs, open-source LLMs and LLMs classification, multi-lingual LLM, sentence BERT, LoRA, prompt learning, and retrieval-augmented generation. Read Jurafsky & Martin, chapter 10.10 on ethic issues for LLM. Lecture 06aux: ChatBots Content: Pre-training. Supervised fine-tuning. Reward model. Reinforcement learning. References: Lecture slides and video by A. Karpathy (see Day 10 box). Lecture 07: Part-of-Speech Tagging Content: Part-of-speech (PoS) and PoS tagging task. Evaluation for PoS tagging. Hidden Markov model (HMM), emission and transition probabilities. Probability estimation. HMM as automata. The Viterbi algorithm for decoding. The forward algorithm and the trellis data structure. The backward algorithm. Forward-backward algorithm, E-step and M-step. Conditional random fields (CRF) and global features. Linear chain CRF, local features and feature templates. Decoding for CRF using Viterbi algorithm. Training algorithm for CRF: loss function, regularization, and stochastic gradient descent (sketch only). Neural PoS taggers using local search: fixed-window feed-forward neural model, recurrent neural model, and recurrent bidirectional model. Neural PoS taggers using global search: neural model combining RNN and CRF (sketch only). Named entity recognition and other sequence labelling tasks. References: Jurafsky & Martin, chapter 8; skip section 8.5.2. Use lecture slides for forward algorithm, trellis, backward algorithm, and forward-backward algorithm or else look into Jurafsky & Martin appendix A (available through the textbook web page only). Training algorithm for linear chain CRF is taken from Eisenstein, section 7.5.3. Use lecture slides for the fixed-window feed-forward neural model. Recurrent bidirectional model and neural structured prediction are taken from Eisenstein, section 7.6.1. Lecture 10: Dependency Parsing Content: Dependency trees and grammatical functions. Dependency formalisms, projective and non-projective trees. Dependency tree banks and universal dependency project. Transition-based dependency parsing. Arc-standard parser: transition definition, oracle algorithm, and generation of training data. Feature extraction: feature functions and feature templates. Alternative models for dependency parsing (very basic notions only): arc-eager parser, Attardi parser (non-projective), transition-based parsing with beam search, and graph-based dependency parsing (non-projective). Neural architectures for dependency parsing: case study, Kiperwasser & Goldberg 2016; feature extraction using bidirectional LSTM; training algorithm using hinge loss. References: Jurafsky & Martin, chapter 18; skip section 18.3. Slides from the lecture for the part on neural dependency parsing. Lecture 11: Semantic Parsing Content: Semantics and referential meaning. Lexical semantics. Word senses and semantic relations between word senses. WordNet and the notion of synset. Word sense disambiguation. Semantic roles and thematic grid. Proposition Bank and semantic roles. FrameNet and the notion of frame. Semantic role labeling and a neural algorithm; datasets and evaluation. Selectional restrictions. Meaning representation and formalisms. Abstract meaning representation (AMR) and transition-based AMR parsing. References: Jurafsky & Martin, chapter 20; skip sections 20.6.1, 20.7.1 and 20.7.2. Slides from the lecture for the part on meaning representation, AMR and semantic parsing algorithm for AMR. Lecture 12: Machine Translation Content: Word ordering and V,S,O language classification. Word translation and word alignment relation. Statistical machine translation (SMT) using language model and translation model. Neural machine translation (NMT). Encoder-decoder neural architecture (seq2seq). Autoregressive encoder-decoder using RNN: greedy inference algorithm; training algorithm, average cross-entropy loss, teacher forcing. Encoder-decoder RNN using attention techniques: dynamic context, dot-product attention, and bilinear attention. Transformer-based architecture for NMT: cross-attention, query, key and value. Search tree and beam search. Parallel corpora. Evaluation: BLEU metric and METEOR metric. Leaderboards. References: Jurafsky & Martin, chapter 13; skip SentencePiece from section 13.2.1, and formula (13.10) from section 13.2.2. Jurafsky & Martin, chapter 9: sections 9.7, 9.8. Jurafsky & Martin, chapter 10: section 10.1 is taken for granted from the deep learning prerequisite course. Use lecture slides for word alignment, statistical machine translation, and beam-search. Lecture 13: Question Answering Content: Question answering (QA) and factoid questions. Text-based QA, information retrieval and reading comprehension. Answer span extraction using contextual embeddings: start and end vectors, fine-tuning loss, and negative examples. Answer span extraction using RNN and attention: Stanford attentive reader; bilinear product and attention. Practical issues. Retrieval-augmented generation. Datasets, evaluation measures, and leaderboard. Knowledge-based QA; entity linking. References: Jurafsky & Martin, chapter 14; skip sections 14.1, 14.2; skip definition (14.22) from section 14.4. Stanford attentive reader is taken from Eisenstein, section 17.5.2. Use lecture slides for retrieval-augmented generation, knowledge-based QA and entity linking. References: Jurafsky & Martin, chapter 15. Section 15.5: just read.