NATURAL LANGUAGE PROCESSING 2025-2026 - INQ0091105
Section outline
-
Content: Detailed description of course content and prerequisites can be found here.
Textbook: The adopted textbook is Speech and Language Processing (3rd Edition, draft, Jan 6th 2026) by Dan Jurafsky and James H. Martin, available here.
Logistics: Lectures are on Thursday 10:30-12:30 (room Ce) and on Friday 10:30-12:30 (room Ce).
Office hours: Thursday 12:30-14:30, email appointment required. Meetings can be face-to-face or else on-line at this Zoom link.
-
Forum for general news and announcements. Only the lecturer can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
-
Forum for discussion of technical matter presented during the lectures. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
-
Forum for project discussion. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
-
-
March 5th, Thursday (10:30-12:30)
Course administration and presentation
- Content outline
- Laboratory sessions
- Course requirements
- Textbook
- Project
- Coursework
Natural language processing: An unexpected journey
- What is natural language processing?
- Very short history of natural language processing
- Why is natural language processing tricky?
- Ambiguity, composition, recursion and hidden structure
- Language & learning
- Miscellanea
References
- Slides from the lecture
Resources
-
March 6th, Friday (10:30-12:30)
Text normalization
- Word types and word tokens
- Herdan law and Zipf law
- Morphology and word-form
- Corpora
- Language identification and spell checking
- Text normalization: contraction, punctuation and special characters
- Word tokenization, character tokenization, and subword tokenization
- Byte-pair encoding algorithm: learner, encoder and decoder
References
- Jurafsky and Martin, chapter 2
Resources
-
March 12th, Thursday (10:30-12:30)
Text normalization
- Byte-pair encoding algorithm: learner, encoder and decoder (cont'd)
- Sentence segmentation and case folding
- Stop words, stemming and lemmatization
Words and meaning
- Lexical semantics
- Distributional semantics
- Count-based embeddings
- Word2vec and skip-gram
- Logistic regression
- Training
References
- Jurafsky and Martin, chapter 2
- Jurafsky and Martin, chapter 5
Resources
-
March 13th, Friday (10:30-12:30)
Words and meaning
- Training (cont'd)
- Practical issues
- FastText and GloVe
- Semantic properties of neural word embeddings
- Evaluation
- Cross-lingual word embeddings
References
- Jurafsky and Martin, chapter 5
- Voita, NLP Course | For You (web course): Word embeddings
-
March 19th, Thursday (10:30-12:30)
Statistical language models
- Language modeling: prediction and generation
- Language modeling applications
- Relative frequency estimation
- N-gram model
- N-gram probabilities and bias-variance trade-off
- Practical issues
- Evaluation: perplexity measure
- Sampling sentences
- Smoothing: Laplace and add-k smoothing
- Stupid backoff
References
- Jurafsky and Martin, chapter 3
Resources
-
March 20th, Friday (10:30-12:30)
Statistical language models
- Linear interpolation
- Out-of-vocabulary words
- Limitations of N-gram model
Neural language models (NLM)
- General architecture for NLM
- Feedforward NLM: inference
- Feedforward NLM: training
- Recurrent NLM: inference
- Recurrent NLM: training
Exercises
- Subword tokenization: BPE algorithm
References
- Jurafsky and Martin, section 6.5
- Jurafsky and Martin, section 13.2
- Voita, NLP Course | For You (web course): Language Modeling
Resources
-
March 26th, Thursday (10:30-12:30)
Neural language models (NLM)
- Practical issues: parameter freezing, weight tying, softmax temperature
Contextual word embeddings
- Static embeddings vs. contextualized embeddings
- ELMo
- BERT: encoder-only model
- BERT: masked language modeling
References
- Jurafsky and Martin, chapter 9
- Voita, NLP Course | For You (web course): Language Modeling
- Slides from lecture
Resources
-
March 27th, Friday (10:30-12:30)
Contextual word embeddings
- BERT: next sentence prediction
- Applications of BERT: sentiment analysis
- Applications of BERT: named-entity recognition
- GPT-n decoder-only model
- Sentence-BERT
References
- Jurafsky and Martin, chapter 9
- Slides from lecture
-
April 2nd, Thursday (10:30-12:30)
Lab Session I: Static word embeddings
- Introduction to the gensim library
- Common operations with word embeddings: lookup, similarity, NN retrieval
- Visualizing word embeddings: dimensionality reduction with PCA
- Intrisic evaluation of word embeddings: word similarity and word analogy benchmarks
Large language models and pretraining
- Pretraining and transfer learning
- Large language models
- Language modeling head
- Text completion and decoder-only model
- Casting NLP tasks as text completion
References
- Jurafsky and Martin, chapter 7
- Jurafsky and Martin, chapter 8
Resources
-
April 9th, Thursday (10:30-12:30)
Large language models and pretraining
- Sampling
- Key-Value cache
- Pretraining
References
- Jurafsky and Martin, chapter 7
- Jurafsky and Martin, chapter 8
-
April 10th, Friday (10:30-12:30)
Lab Session II: Hugging Face Transformer
- General overview of the library
- Importing and using BERT
- Using Gemma and chat templates
Large language models and pretraining
- Training corpora
References
- Jurafsky and Martin, chapter 7
- Jurafsky and Martin, chapter 8
- Slides from lecture
Resources
-
April 16th, Thursday (10:30-12:30)
Large language models and pretraining
- Scaling laws for LLMs (cont'd)
- Overview of LLMs
- Classification for LLMs
- Multi-lingual LLMs
- Miscellanea
References
- Jurafsky and Martin, chapter 7
- Jurafsky and Martin, chapter 8
- Slides from lecture
-
April 17th, Friday (10:30-12:30)
Large language models and post-training
- Fine-tuning
- Instruction tuning
- Datasets for instruction tuning
- Model Alignment
References
- Jurafsky and Martin, chapter 10
Resources
-
April 23th, Thursday (10:30-12:30)
Large language models and post-training
- Preference-based learning
- Modeling preferences
- Learning to score preferences
- LLM alignment via preference learning
- Direct preference optimization
- Parameter efficient fine-tuning: adapters
- Parameter efficient fine-tuning: LoRA
- Transfer learning
References
- Jurafsky and Martin, chapter 10
- Jurafsky and Martin, chapter 8
- Voita, NLP Course | For You (web course): Transfer Learning
- Slides from lecture
-
April 24th, Friday (10:30-12:30)
Lab Session III: Retrieval Augmented Generation
- Embedder (BERT) and Tokenizer
- GEMMA
- Load Documents, Embed them and save to Knowledge Base
- Retrieval: Top-n Similar Chunks
- Generation: Answer from Retrieved Context
References
- Slides from lecture
Resources
-
Caricare direttamente in Google Colab
-
April 30th, Thursday (10:30-12:30)
ChatBot
- Introduction to ChatBots
- Domain adaptation
- Lifecycle of a chatBot
- Datasets
- Prompt
- Prompt engineering
References
- Jurafsky and Martin, chapter 7
- Slides from lecture
Resources
-
External video: Training pipeline of GPT assistants like ChatGPT by Andrej Karpathy, 2023. First part only: stop at time-lapse 20:17
-
May 7th, Thursday (10:30-12:30)
Retrieval-Augmented Generation
- Introduction to RAG
- Neural information retrieval
- Cross-encoder
- Bi-encoder
- ColBERT
- Generation
- Advanced RAG methods
- Datasets
- Evaluation
Part-of-speech tagging
- Part-of-speech (PoS) and part-of-speech tagging
- Evaluation
Hidden Markov models
- Structured prediction problems
References
- Jurafsky and Martin, chapter 11
- Jurafsky and Martin, chapter 17
- Slides from lecture
Resources
-
May 8th, Friday (10:30-12:30)
Hidden Markov models
- Definition of Hidden Markov model (HMM)
- Probability estimation for HMM
- HMMs as automata with output
- Decoding via Viterbi algorithm
Neural part-of-speech tagging
- Local search
- Fixed-window neural model
- Recurrent neural model
- Recurrent bidirectional model
Sequence labelling
- Named entity recognition (NER)
- BIO labeling
- NER evaluation
- Other sequence labelling tasks
References
- Jurafsky and Martin, chapter 17
- Slides from the lecture
Resources
-
May 14th, Thursday (10:30-12:30)
Dependency parsing
- Dependency trees
- Grammatical functions
- Projective and non-projective dependency trees
- Dependency treebanks
- Transition-based dependency parsing
- Arc-standard parser
- Transitions definition
Exercises
- Viterbi algorithm
References
- Jurafsky and Martin, chapter 19
- Slides from the lecture
Resources
-
May 21st, Thursday (10:30-12:30)
Dependency parsing
- Ambiguity
- Oracle and generation of training data
- Example
- Feature extraction, feature functions and feature templates
Neural dependency parsing
- Case study: Kiperwasser & Goldberg 2016
- Feature extraction using BiLSTM
- Hinge loss function
- Evaluation: UAS and LAS
References
- Jurafsky and Martin, chapter 19
- Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations, Kiperwasser and Goldberg, TACL, vol. 4, 2016
-
May 22nd, Friday (10:30-12:30)
Lab Session IV: Named-Entity Recognition
- BERT-based NER
- Gemma prompting for NER
- Evaluation
- Exercise
Resources
-
Caricare direttamente in Google Colab
-
May 28th, Thursday (10:30-12:30)
Machine translation
- Word ordering and V,S,O language classification
- Word translation and word alignment relation
- Neural machine translation (NMT): general idea
- Encoder-decoder architecture (seq2seq): general idea
Exercises
- Arc-standard oracle
References
- Jurafsky and Martin, chapter 12
- Slides from the lecture
Resources
-
In this box I am reporting the text for final exams from past sessions/years.
Be aware that the program for academic year 2025/26 has changed considerably, therefore several of the questions that you will find in the final exams for the previous years are not in the program of academic year 2025/26.