Course: INQ0091105 - NATURAL LANGUAGE PROCESSING 2023-2024

Topic outline

INQ0091105 - NATURAL LANGUAGE PROCESSING 2023-2024 - PROF. GIORGIO SATTA

Collapse all Expand all
Content: Detailed description of course content and prerequisites can be found here.

Textbook: The adopted textbook is Speech and Language Processing (3rd Edition, draft, January 7th, 2023) by Dan Jurafsky and James H. Martin, available here.

Additional resources: The following textbook can be used for consultation only: Introduction to Natural Language Processing by Jacob Eisenstein, October 2019, MIT Press, preprint version available here. The course also uses an electronic forum for discussion of technical matter and administrative information. You can also access video recordings of the lectures from academic year 2021/22 at this link

Logistics: Lectures are on Wednesday 10:30-12:30 (room Me) and on Friday 10:30-12:30 (room De).

Office hours: Thursday 12:30-14:30, email appointment required. Meetings can be face-to-face or else on-line at this Zoom link.
- Announcements Forum
  
  Forum for general news and announcements. Only the lecturer can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
- Class Forum
  
  Forum for discussion of technical matter presented during the lectures. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
- Lab Forum
  
  Forum for discussion of technical matter presented during the open laboratory sessions. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
- Project Forum
  
  Forum for project discussion. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
Day 01
February 28th, Wednesday (10:30-12:30)

Course administration and presentation

Content outline

Laboratory sessions

Course requirements

Textbook

Project

Coursework

Statistics

Lecturer evaluation

Natural language processing: An unexpected journey

What is natural language processing?

A few case studies: finance, social networks, health

Very short history of natural language processing

Why is natural language processing tricky?

Ambiguity, composition, recursion and hidden structure

How does natural language processing work?

Learning & knowledge

Search & learning

Market, environment and ethics

References

Slides from the lecture

Eisenstein, chapter 1 for learning & knowledge and for search & learning

Resources
- 00_class_presentation File
- 01_introduction_to_nlp File
Day 02
March 1st, Friday (10:30-12:30)

Essentials of linguistics

What is linguistics?

Phonology

Morphology

Part of speech

Syntax: phrase structure and dependency structure

Lexical semantics and general semantics

Pragmatics and discourse

Text normalization

Word types and word tokens

Corpora

Language identification and spell checking

Text normalization: contraction, punctuation and special characters

References

Slides from the lecture

Jurafsky and Martin, chapter 2

Resources
- 02_essentials_of_linguistics File
- 03_text_normalization File
Day 03
March 6th, Wednesday (10:30-12:30)

Text normalization

Word tokenization, character tokenization, and subword tokenization

Byte-pair encoding algorithm

Sentence segmentation and case folding

Stop words, stemming and lemmatization

Research papers

Words and meaning

Lexical semantics

Distributional semantics

Review: vectors

Term-context matrix

References

Jurafsky and Martin, chapter 2

Jurafsky and Martin, chapter 6

Eisenstein, section 14.3

Resources
- 04_word_embeddings File
Day 04
March 8th, Friday (10:30-12:30)

Words and meaning

Pointwise mutual information

Probability estimation

Examples

Practical issues

Truncated singular value decomposition

Neural word embeddings

Word2vec and skip-gram

Logistic regression

Training

Practical issues

FastText and GloVe

References

Jurafsky and Martin, chapter 6

Voita, NLP Course | For You (web course): Word embeddings
Day 05
March 13th, Wednesday (10:30-12:30)

Words and meaning

Semantic properties of neural word embeddings

Evaluation

Cross-lingual word embeddings

Research papers

Language models

Language modeling: prediction and generation

Language modeling applications

Relative frequency estimation

N-gram model

N-gram probabilities and bias-variance trade-off

Practical issues

Evaluation: perplexity measure

Sampling sentences

Smoothing: Laplace and add-k smoothing

Stupid backoff and linear interpolation

References

Jurafsky and Martin, chapter 6

Jurafsky and Martin, chapter 3

Resources
- 05_language_modeling File
Day 06
March 15th, Friday (10:30-12:30)

Language models

Out-of-vocabulary words

Limitations of N-gram model

Research papers

Neural language models (NLM)

General architecture for NLM

Feedforward NLM: inference

Feedforward NLM: training

Recurrent NLM: inference

Exercises

Subword tokenization: BPE algorithm

References

Jurafsky and Martin, chapter 3

Voita, NLP Course | For You (web course): Language Modeling

Jurafsky and Martin, section 7.5

Jurafsky and Martin, section 7.7

Jurafsky and Martin, section 9.2
Day 07
March 20th, Wednesday (10:30-12:30)

Neural language models (NLM)

Recurrent NLM: inference (continued)

Recurrent NLM: training

Practical issues: parameter freezing, weight tying, softmax temperature

Contextualised word embeddings

Transformers: short recap

Attention

Static embeddings vs. contextualized embeddings

ELMo

References

Jurafsky and Martin, section 9.2

Jurafsky and Martin, chapter 11

Voita, NLP Course | For You (web course): Language Modeling

Slides from lecture

Resources
- 06_large_language_models File
Day 08
March 22nd, Friday (10:30-12:30)

Large language models

BERT: masked language modeling and next sentence prediction

Other models

The GPT-n family of large language models

Other large language models

Multi-lingual large language models

References

Jurafsky and Martin, chapter 11

Voita, NLP Course | For You (web course): Transfer Learning

Slides from lecture
Lab Session I: word embeddings
March 22nd, Friday (16:30-18:30)

Using pretrained word embeddings

Introduction to the gensim library

Common operations with word embeddings: lookup, similarity, NN retrieval

Visualizing word embeddings: dimensionality reduction with PCA

Intrisic evaluation of word embeddings: word similarity and word analogy benchmarks

Pretraining word embeddings

Using gensim to pretrain word embeddings (Word2Vec style)

Saving and loading embeddings

Extrinsic evaluation of word embeddings

Using word2vec representations for spam classification

Resources
- Lab Session I: notebook and dataset
  notebook_s1_solved.ipynb
  notebook_s1.ipynb
  SMSSpamCollection.tsv
- Skip-gram with negative sampling in PyTorch URL
- Open lab session I Kaltura Video Resource
Day 09
March 27th, Wednesday (10:30-12:30)

Large language models

Multi-lingual large language models (continued)

Sentence BERT

Miscellanea: emergent abilities, hallucinations, mixture of experts

Research papers

Fine-tuning

Adaptation: feature extraction vs. fine-tuning; catastrophic forgetting

Adapters

LoRA

Transfer learning

Exercises

Positive pointwise mutual information (PPMI)

References

Jurafsky and Martin, chapter 11

Voita, NLP Course | For You (web course): Transfer Learning

Slides from lecture
Day 10
April 3rd, Wednesday (10:30-12:30)

Fine-tuning

Prompt learning

Retrieval augmented generation

Large language models and ethics

Research papers

ChatBots

Supervised fine-tuning

Reward modeling from human feedback

Reinforcement learning training

References

Jurafsky and Martin, section 10.10

Slides from lecture

Resources
- 06_aux_chatbot File
  
  Slides: Training pipeline of GPT assistants like ChatGPT by Andrej Karpathy, 2023. First part only: stop at slide #30.
- 06_aux_chatbot URL
  
  External video: Training pipeline of GPT assistants like ChatGPT by Andrej Karpathy, 2023. First part only: stop at time-lapse 20:17
Day 11
April 5th, Friday (10:30-12:30)

Part-of-speech tagging

Part-of-speech (PoS) and part-of-speech tagging

Evaluation

Hidden Markov models

Definition of Hidden Markov model (HMM)

Probability estimation for HMM

HMMs as automata with output

Decoding via Viterbi algorithm

Forward algorithm

Trellis representation

Backward algorithm

References

Jurafsky and Martin, chapter 8

Slides from the lecture

Resources
- 07_part_of_speech_tagging File
Day 12
April 10th, Wednesday (10:30-12:30)

Hidden Markov models

Forward-backward algorithm: motivation

E-step and M-step

Research papers

Conditional random fields

Conditional random fields (CRF) and global features

Linear chain CRF, local features and feature templates

Inference algorithm

Training algorithm

Research papers

References

Jurafsky and Martin, chapter 8

Jurafsky and Martin, appendix A

Eisenstein, section 7.5.3
Day 13
April 12th, Friday (10:30-12:30)

Neural part-of-speech tagging

Local search

Fixed-window neural model

Recurrent neural model

Recurrent bidirectional model

Global search

Learnable transition features

LSTM-CRF model

Sequence labelling

Named entity recognition (NER)

BIO labeling

NER evaluation

Other sequence labelling tasks

References

Jurafsky and Martin, chapter 8

Eisenstein, section 7.6.1
Day 14
April 17th, Wednesday (16:30-18:30)

Dependency parsing

Dependency trees

Grammatical functions

Projective and non-projective dependency trees

Dependency treebanks

Transition-based dependency parsing

Exercises

N-gram model and \(k\)-smoothing

References

Jurafsky and Martin, chapter 18

Resources
- 10_dependency_parsing File
Day 15
April 19th, Friday (10:30-12:30)

Dependency parsing

Arc-standard parser

Transitions definition

Ambiguity

Oracle

Example

Exercises

Part-of-speech tagging

HMM supervised estimation

References

Jurafsky and Martin, chapter 18
Lab Session II: Introduction to Transformers with Huggingface
April 19th, Friday (16:30-18:30)

Transformers & Huggingface

Huggingface hub

Transformer

Tokenizer

Datasets

Fine-tuning a transformer model

Evaluation
Generation

Resources
- notebook_s2.ipynb File
- Open lab session II Kaltura Video Resource
Day 16
April 24th, Wednesday (10:30-12:30)

Dependency parsing

Arc-standard parser

Oracle and generation of training data

Feature extraction, feature functions and feature templates

Alternative models for dependency parsing: beam search and graph-based dependency parsing

Neural dependency parsing

Case study: Kiperwasser & Goldberg 2016

Feature extraction using BiLSTM

Hinge loss function

References

Jurafsky and Martin, chapter 18

Slides from the lecture

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations, Kiperwasser and Goldberg, TACL, vol. 4, 2016
Lab Session III: Introduction to LangChain
April 29th, Monday

Introduction to LangChain Library

Model I/O

Data connection

Chains

Agents

Memory

Callbacks

Resources
- notebook_s3.ipynb File
- Open lab session III Kaltura Video Resource
Day 17
May 3rd, Friday (10:30-12:30)

Dependency parsing

Alternative models for dependency parsing: graph-based dependency parsing

Evaluation: UAS and LAS

Exercises

Viterbi algorithm

References

Jurafsky and Martin, chapter 18

Slides from the lecture
Day 18
May 8th, Wednesday (10:30-12:30)

Semantic parsing

Referential meaning and general semantics

Lexical semantics resources: WordNet and word senses

Word sense disambiguation

Semantic roles and thematic grid

Lexical semantic resources: PropBank and FrameNet

Semantic role labeling (SRL)

Neural algorithm for SRL

Argument selection and selectional restrictions

Referential meaning and meaning representations

Abstract meaning representation formalism (AMR)

Semantic parsing and transition-based approaches

Research papers

References

Jurafsky and Martin, chapter 20

Slides from the lecture

Resources
- 11_semantic_parsing File
Day 19
May 10th, Friday (10:30-12:30)

Machine translation

Word ordering and V,S,O language classification

Word translation and word alignment relation

Statistical machine translation (SMT)

Translation model + language model

Neural machine translation (NMT): general idea

Encoder-decoder architecture (seq2seq): general idea

Exercises

Arc-standard oracle

References

Jurafsky and Martin, chapter 13

Slides from the lecture

Resources
- 12_machine_translation File
Lab Session IV: neural dependency parsing
May 10th, Friday

Arc-standard model

Implement an arc-standard parser

Implement an associated oracle

Train a neural model

Resources
- Lab Session IV: notebook and slides
  notebook_s4.ipynb
  slides_s4.pdf
- Open lab session IV (part a) Kaltura Video Resource
- Open lab session IV (part b) Kaltura Video Resource
- Open lab session IV (part c) Kaltura Video Resource
Day 20
May 15th, Wednesday (10:30-12:30)

Machine translation

RNN: autoregressive encoder-decoder

RNN: greedy inference algorithm

RNN: training algorithm and teacher forcing

RNN: attention and dynamic context vector

RNN: dot-product attention

RNN: bilinear attention

Transformer-based architecture

Cross-attention, query, key and value

Search tree and beam search

Evaluation: BLEU and METEOR

NMT and leaderboard

Parallel corpora

Research papers

References

Jurafsky and Martin, chapter 13

Jurafsky and Martin, chapter 9

Jurafsky and Martin, chapter 10
Day 21
May 17th, Friday (10:30-12:30)

Question answering

Question answering (QA) and factoid questions

Text-based QA: IR + machine reading

Machine reading based on contextual embeddings

Start and end probabilities

Candidate score and fine-tuning loss

Negative examples and sliding windows

Exercises

Spurious ambiguity

References

Jurafsky and Martin, chapter 14

Resources
- 13_question_answering File
Lab Session V: Summarization
May 17th, Friday

Summarization

T5 model

Dataset

Data collator and training

Evaluation: ROUGE measure

Resources
- notebook_s5 File
- Open lab session V Kaltura Video Resource
Day 22
May 22nd, Wednesday (10:30-12:30)

Question answering

Machine reading based on attention: Stanford attentive reader

Bilinear product attention

Practical issues

Retrieval augmented generation

Research papers

Datasets and leaderboards

Answer sentence selection

Knowledge-based QA

Entity linking

Dialogue

Human Conversation and turns

Dialogue and speech acts

Grounding

Dialogue structure, adjacency pairs, and sub-dialogue

Inference

Chatbots

Rule-based systems

Corpus-based systems

Response by retrieval

Response by generation

Hybrid systems

Research papers

Virtual assistants: frame-based

Frame-based dialogue systems

Slot/value pairs and question templates

Domain classification, intent determination, and slot filling

References

Eisenstein, section 17.5.2

Jurafsky and Martin, chapter 14

Slides from the lecture

Jurafsky and Martin, chapter 15

Resources
- 14_dialogue File
Day 23
May 24th, Friday (10:30-12:30)

Exercises

Word embeddings: parallelogram model

Dependency tree and ambiguity

Virtual assistants: dialogue-state

General architecture

Dialogue acts

Natural language understanding

Dialog state tracker

Dialog policy

Natural language generation

Dialogue systems

Evaluation

Ethical issues

References

Jurafsky and Martin, chapter 15

Resources
Day 24
May 29th, Wednesday (10:30-12:30)

Discussion & conclusions

NLP timeline

Open problems

Explainability

Grounding

Theory vs. invention

Ethics

The NLP Hype

NLU Datasets

Generative AI project lifecycle

FLAN Datasets

ChatBot Arena

Wrap up

Overview of past years final exams

Overview of course syllabus

References

Slides from the lecture

Resources
- 15_conclusions File
Course syllabus
The course syllabus is based on

the adopted textbook: 'Speech and Language Processing' by Dan Jurafsky and James H. Martin, 3rd Edition, draft from January 7th, 2023, available on web

auxiliary textbook 'Introduction to Natural Language Processing' by Jacob Eisenstein, October 2019, MIT Press, preprint version available on web

on-line course 'NLP Course | For You' by Elena Voita, University of Edinburgh

lecture slides, available on the course website
- Course syllabus academic year 2023-24 File
Project submission
Instruction for project registration, preparation and submission have been already posted in the Project forum.

In addition, I intend to run on your software a tool for plagiarism detection. In order to do this, I am asking you to extract all of your code from the notebook for part II of your project, and to submit the .py file to the assignment activity below associated to the date of your final exam. Only one student per group should make the .py submission.
- Code submission: June 2024 session Assignment
  
  Opened: Saturday, 1 June 2024, 12:00 AM
  
  Due: Friday, 21 June 2024, 11:59 PM
  
  Upload here a code only version of part II of your project.
- Code submission: July 2024 session Assignment
  
  Opened: Monday, 1 July 2024, 12:00 AM
  
  Due: Sunday, 14 July 2024, 11:59 PM
  
  Upload here a code only version of part II of your project.
- Code submission: September 2024 session Assignment
  
  Opened: Tuesday, 3 September 2024, 12:00 AM
  
  Due: Monday, 9 September 2024, 11:59 PM
Final Exams
In this box I am going to report the text for final exams from past sessions/years.

Topic outline

INQ0091105 - NATURAL LANGUAGE PROCESSING 2023-2024 - PROF. GIORGIO SATTA

Day 01

Course administration and presentation

Natural language processing: An unexpected journey

References

Resources

Day 02

Essentials of linguistics

Text normalization

References

Resources

Day 03

Text normalization

Words and meaning

References

Resources

Day 04

Words and meaning

References

Day 05

Words and meaning

Language models

References

Resources

Day 06

Language models

Neural language models (NLM)

Exercises

References

Day 07

Neural language models (NLM)

Contextualised word embeddings

References

Resources

Day 08

Large language models

References

Lab Session I: word embeddings

Using pretrained word embeddings

Pretraining word embeddings

Extrinsic evaluation of word embeddings

Resources

Day 09

Large language models

Fine-tuning

Exercises

References

Day 10

Fine-tuning

ChatBots

References

Resources

Day 11

Part-of-speech tagging

Hidden Markov models

References

Resources

Day 12

Hidden Markov models

Conditional random fields

References

Day 13

Neural part-of-speech tagging

Sequence labelling

References

Day 14

Dependency parsing

Exercises

References

Resources

Day 15

Dependency parsing

Exercises

References

Lab Session II: Introduction to Transformers with Huggingface

Transformers & Huggingface

Resources

Day 16

Dependency parsing