Course: NATURAL LANGUAGE PROCESSING 2024-2025 - INQ0091105

Topic outline

INQ0091105 - NATURAL LANGUAGE PROCESSING 2024-2025 - PROF. GIORGIO SATTA

Collapse all Expand all
Content: Detailed description of course content and prerequisites can be found here.

Textbook: The adopted textbook is Speech and Language Processing (3rd Edition, draft, January 12, 2025) by Dan Jurafsky and James H. Martin, available here.

Additional resources: The following textbook can be used for consultation only: Introduction to Natural Language Processing by Jacob Eisenstein, October 2019, MIT Press, preprint version available here. The course also uses an electronic forum for discussion of technical matter and administrative information. You can also access video recordings of the lectures from academic year 2021/22 at this link

Logistics: Lectures are on Monday 16:30-18:30 (room Ce) and on Wednesday 16:30-18:30 (room Ce).

Office hours: Wednesday 12:30-14:30, email appointment required. Meetings can be face-to-face or else on-line at this Zoom link.
- Announcements Forum
  
  Forum for general news and announcements. Only the lecturer can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
- Class Forum
  
  Forum for discussion of technical matter presented during the lectures. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
- Lab Forum
  
  Forum for discussion of technical matter presented during the open laboratory sessions. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
- Project Forum
  
  Forum for project discussion. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
Day 01
February 24th, Monday (16:30-18:30)

Course administration and presentation

Content outline

Laboratory sessions

Course requirements

Textbook

Project

Coursework

Statistics

Lecturer evaluation

Natural language processing: An unexpected journey

What is natural language processing?

A few case studies: finance, social networks, health

Very short history of natural language processing

Why is natural language processing tricky?

Ambiguity, composition, recursion and hidden structure

How does natural language processing work?

Learning & knowledge

Search & learning

Market, environment and ethics

References

Slides from the lecture

Eisenstein, chapter 1 for learning & knowledge and for search & learning

Resources
- 00_class_presentation File
- 01_introduction_to_nlp File
Day 02
February 26th, Wednesday (16:30-18:30)

Essentials of linguistics

What is linguistics?

Phonology

Morphology

Part of speech

Syntax: phrase structure and dependency structure

Lexical semantics and general semantics

Pragmatics and discourse

Text normalization

Regular expressions

Word types and word tokens

Corpora

Language identification and spell checking

Text normalization: contraction, punctuation and special characters

References

Slides from the lecture

Jurafsky and Martin, chapter 2

Resources
- 02_essentials_of_linguistics File
- 03_text_normalization File
Day 03
March 3rd, Monday (16:30-18:30)

Text normalization

Word tokenization, character tokenization, and subword tokenization

Byte-pair encoding algorithm

Sentence segmentation and case folding

Stop words, stemming and lemmatization

Research papers

Words and meaning

Lexical semantics

Distributional semantics

Review: vectors

Term-context matrix

References

Jurafsky and Martin, chapter 2

Jurafsky and Martin, chapter 6

Eisenstein, section 14.3

Resources
- 04_word_embeddings File
Day 04
March 5th, Wednesday (16:30-18:30)

Words and meaning

Pointwise mutual information

Probability estimation

Examples

Practical issues

Truncated singular value decomposition

Neural word embeddings

Word2vec and skip-gram

Logistic regression

References

Jurafsky and Martin, chapter 6

Voita, NLP Course | For You (web course): Word embeddings
Day 05
March 10th, Monday (16:30-18:30)

Words and meaning

Training

Practical issues

FastText and GloVe

Semantic properties of neural word embeddings

Evaluation

Cross-lingual word embeddings

Research papers

Language models

Language modeling: word prediction and sentence distribution

Language modeling applications

Relative frequency estimation

N-gram model

References

Jurafsky and Martin, chapter 6

Jurafsky and Martin, chapter 3

Resources
- 05_language_modeling File
Day 06
March 12th, Wednesday (16:30-18:30)

Language models

N-gram probabilities and bias-variance trade-off

Practical issues

Evaluation: perplexity measure

Sampling sentences

Smoothing: Laplace and add-k smoothing

Stupid backoff and linear interpolation

Out-of-vocabulary words

Limitations of N-gram model

Research papers

Exercises

Subword tokenization: BPE algorithm

References

Jurafsky and Martin, chapter 3
Day 07
March 17th, Monday (16:30-18:30)

Neural language models (NLM)

General architecture for NLM

Feedforward NLM: inference

Feedforward NLM: training

Recurrent NLM: inference

Recurrent NLM: training

Practical issues: parameter freezing, weight tying, softmax temperature

References

Voita, NLP Course | For You (web course): Language Modeling

Jurafsky and Martin, sections 7.6, 7.7

Jurafsky and Martin, section 8.2

Resources
Day 08
March 19th, Wednesday (16:30-18:30)

Transformers: short recap

Attention

Encoder

Decoder

Residual stream

Contextualised word embeddings

Static embeddings vs. contextualized embeddings

ELMo

BERT: encoder-only model

Masked language modeling

Next sentence prediction

References

Jurafsky and Martin, chapter 9

Jurafsky and Martin, sections 11.1, 11.2, 11.3

Voita, NLP Course | For You (web course): Language Modeling

Resources
- 06a_contextualized_word_embeddings File
Day 09
March 24th, Monday (16:30-18:30)

Contextualised word embeddings

GPT-n decoder-only model

Sentence-BERT

Large language models

Language modeling head

Text completion and decoder-only model

Casting NLP tasks as text completion

Sampling

Pretraining

References

Jurafsky and Martin, section 9.5

Jurafsky and Martin, chapter 10

Slides from lecture
- 06b_large_language_models File
Day 10
March 26th, Wednesday (16:30-18:30)

Large language models

Pretraining

Training corpora

Scaling laws for LLMs

Overview of LLMs

Multi-lingual LLMs

Exercises

Positive pointwise mutual information (PPMI)

References

Jurafsky and Martin, chapter 10

Slides from lecture
Lab Session I: word embeddings
March 31st, Monday (8:30-10:30)

Using pretrained word embeddings

Static word embeddings

Gensim and pre-trained embeddings

Embeddings visualization with PCA

Word embeddings evaluation: word similarity and word analogy benchmarks

Exercises

Working with pre-trained embeddings

Training your own embeddings

Resources
- Static Word Embeddings Folder
- Static Word Embeddings Kaltura Video Resource
Day 11
March 31st, Monday (16:30-18:30)

Post-training

Fine-tuning

Instruction tuning

Model Alignment

Parameter efficient fine-tuning: adapters

Parameter efficient fine-tuning: LoRA

Transfer learning

References

Jurafsky and Martin, sections 11.4, 12.2, 12.3

Voita, NLP Course | For You (web course): Transfer Learning

Slides from lecture
- 06c_post-training File
Day 12
April 2nd, Wednesday (16:30-18:30)

ChatBot

ChatBot

Datasets

Prompt

Prompt engineering

Retrieval-augmented generation

References

Jurafsky and Martin, chapter 12, skip 12.2, 12.3 which are part of the previous lecture

Slides from lecture
- 06d_chatbot File
- chatGPT_video URL
  
  External video: Training pipeline of GPT assistants like ChatGPT by Andrej Karpathy, 2023. First part only: stop at time-lapse 20:17
Day 13
April 7th, Monday (16:30-18:30)

ChatBot

Large reasoning models

Part-of-speech tagging

Part-of-speech (PoS) and part-of-speech tagging

Evaluation

Hidden Markov models

Definition of Hidden Markov model (HMM)

Probability estimation for HMM

References

Jurafsky and Martin, chapter 17

Slides from the lecture

Resources
- 07_part_of_speech_tagging File
Day 14
April 9th, Wednesday (16:30-18:30)

Hidden Markov models

HMMs as automata with output

Decoding via Viterbi algorithm

Forward algorithm

Trellis representation

Backward algorithm

Forward-backward algorithm: motivation

E-step and M-step

Research papers

References

Jurafsky and Martin, chapter 17

Jurafsky and Martin, appendix A (from the book web page)
Day 15
April 14th, Monday (16:30-18:30)

Conditional random fields

Conditional random fields (CRF) and global features

Linear chain CRF, local features and feature templates

Inference algorithm

Training algorithm

Research papers

Neural part-of-speech tagging

Local search

Fixed-window neural model

Recurrent neural model

Recurrent bidirectional model

Global search

Learnable transition features

LSTM-CRF model

References

Jurafsky and Martin, chapter 17

Eisenstein, section 7.5.3

Eisenstein, section 7.6.1
Day 16
April 16th, Wednesday (16:30-18:30)

Sequence labelling

Named entity recognition (NER)

BIO labeling

NER evaluation

Other sequence labelling tasks

Phrase-structure parsing (part I)

Constituents and phrase structure

Notions of head, argument and modifier

Grammatical relations

PP-attachment and wh-movement

Treebanks

Context-free grammar (CFG)

Probabilistic CFG

Lexicalized CFG

References

Jurafsky and Martin, chapter 17

Jurafsky and Martin, chapter 18

Slides from lecture
- 08_phrase_structure_parsing_part_1 File
Lab Session II: introduction to transformers with Hugging Face
April 21st, off-line

Transformers & Hugging Face

Hugging Face hub

Transformer

Tokenizer

Datasets

Fine-tuning a transformer model

Evaluation

Generation

Resources
- Transformers and hugging face File
- Transformers and hugging face Kaltura Video Resource
Day 17
April 23rd, Wednesday (16:30-18:30)

Dependency parsing

Dependency trees

Grammatical functions

Projective and non-projective dependency trees

Dependency treebanks

Transition-based dependency parsing

Exercises

Viterbi algorithm

References

Jurafsky and Martin, chapter 19

Resources
- 10_dependency_parsing File
Day 18
April 28th, Monday (16:30-18:30)

Dependency parsing

Arc-standard parser

Transitions definition

Ambiguity

Oracle

Example

Oracle and generation of training data

Feature extraction, feature functions and feature templates

References

Jurafsky and Martin, chapter 19
Day 19
May 5th, Monday (16:30-18:30)

Dependency parsing

Feature extraction, feature functions and feature templates

Alternative models for dependency parsing: beam search and graph-based dependency parsing

Neural dependency parsing

Case study: Kiperwasser & Goldberg 2016

Feature extraction using BiLSTM

Hinge loss function

Evaluation for dependency parsing

Unlabelled attachment score (UAS)

Labelled attachment score (LAS)

References

Jurafsky and Martin, chapter 19

Slides from the lecture

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations, Kiperwasser and Goldberg, TACL, vol. 4, 2016
Lab Session III: natural language generation
May 6th, Tuesday (16:30-18:30)

Natural Language Generation

Bigram models

Large Language Models with Hugging Face

Customizing text generation via decoding strategies

Exercises

Building a trigram model with a backoff strategy

Typo detection using a LLM

Resources
- Natural language generation File
- Natural language generation Kaltura Video Resource
Day 20
May 7th, Wednesday (16:30-18:30)

Machine translation

Word ordering and V,S,O language classification

Word translation and word alignment relation

Statistical machine translation (SMT)

Translation model + language model

Exercises

Arc-standard oracle

References

Jurafsky and Martin, chapter 13

Slides from the lecture

Resources
- 12_machine_translation File
Day 21
May 12th, Monday (16:30-18:30)

Neural machine translation

Neural machine translation (NMT) and posterior probability

Encoder-decoder architecture (seq2seq): general idea

Encoder-decoder through RNN and through transformer

RNN: autoregressive encoder-decoder

RNN: greedy inference algorithm

RNN: training algorithm and teacher forcing

RNN: attention and dynamic context vector

RNN: dot-product attention

RNN: bilinear attention

Transformer-based architecture

Cross-attention, query, key and value

Search tree and beam search

References

Jurafsky and Martin, chapter 13
Lab Session IV: Local running of LLM
May 13th, Tuesday (16:30-18:30)

Local running of LLM

Ollama

OpenWebUI

Infrastructure issues

Computational resources

Exercise: Connecting to LLM via REST API in Python

Discussion
- Local running of LLM File
Day 22
May 14th, Wednesday (16:30-18:30)

Neural machine translation

Parallel corpora

Evaluation: BLEU and METEOR

NMT and leaderboard

Question answering

Question answering (QA) and factoid questions

Text-based QA: IR + machine reading

Exercises

Spurious ambiguity

References

Jurafsky and Martin, chapter 13

Slides from the lecture

Resources
- 13_question_answering File
Day 23
May 19th, Monday (16:30-18:30)

Question answering

Machine reading based on contextual embeddings

Start and end probabilities

Candidate score and fine-tuning loss

Negative examples and sliding windows

Machine reading based on attention: Stanford attentive reader

Bilinear product attention

Practical issues

Research papers

Datasets and leaderboards

Answer sentence selection

Knowledge-based QA

Entity linking

References

Eisenstein, section 17.5.2

Slides from the lecture

Resources
Day 24
May 26th, Wednesday (16:30-18:30)

Semantic parsing

Referential meaning and general semantics

Lexical semantics resources: WordNet and word senses

Word sense disambiguation

Semantic roles and thematic grid

Lexical semantic resources: PropBank and FrameNet

Semantic role labeling (SRL)

Neural algorithm for SRL

Argument selection and selectional restrictions

Referential meaning and meaning representations

Abstract meaning representation formalism (AMR)

Semantic parsing and transition-based approaches

Research papers

References

Jurafsky and Martin, chapter 21

Slides from the lecture

Resources
- 11_semantic_parsing File
Lab Session V: retrieval augmented generation
June 3rd, Tuesday (16:30-18:30)

Retrieval augmented generation (RAG)

Introduction to LangChain

Building a knowledge base with Chroma

Leveraging open-weight LM from Hugging Face

Designing prompt templates

Exercises

Developing a RAG application using domain-specific knowledge that has emerged after the training cutoff of the selected LMs

Resources
- Retrieval augmented generation File
- Retrieval augmented generation Kaltura Video Resource
Course syllabus
The course syllabus is based on

the adopted textbook: 'Speech and Language Processing' by Dan Jurafsky and James H. Martin, 3rd Edition, draft from January 12th 2025, available on web

auxiliary textbook 'Introduction to Natural Language Processing' by Jacob Eisenstein, October 2019, MIT Press, preprint version available on web

on-line course 'NLP Course | For You' by Elena Voita, University of Edinburgh

lecture slides, available on the course website
- Course syllabus academic year 2024-25 File
Final Exams
In this box I am going to report the text for final exams from past sessions/years.

Topic outline

INQ0091105 - NATURAL LANGUAGE PROCESSING 2024-2025 - PROF. GIORGIO SATTA

Day 01

Course administration and presentation

Natural language processing: An unexpected journey

References

Resources

Day 02

Essentials of linguistics

Text normalization

References

Resources

Day 03

Text normalization

Words and meaning

References

Resources

Day 04

Words and meaning

References

Day 05

Words and meaning

Language models

References

Resources

Day 06

Language models

Exercises

References

Day 07

Neural language models (NLM)

References

Resources

Day 08

Transformers: short recap

Contextualised word embeddings

References

Resources

Day 09

Contextualised word embeddings

Large language models

References

Day 10

Large language models

Exercises

References

Lab Session I: word embeddings

Using pretrained word embeddings

Exercises

Resources

Day 11

Post-training

References

Day 12

ChatBot

References

Day 13

ChatBot

Part-of-speech tagging

Hidden Markov models

References

Resources

Day 14

Hidden Markov models

References

Day 15

Conditional random fields

Neural part-of-speech tagging

References

Day 16

Sequence labelling

Phrase-structure parsing (part I)

References

Lab Session II: introduction to transformers with Hugging Face

Transformers & Hugging Face

Resources

Day 17

Dependency parsing

Exercises

References