NATURAL LANGUAGE PROCESSING 2025-2026 - INQ0091105
Section outline
-
Content: Detailed description of course content and prerequisites can be found here.
Textbook: The adopted textbook is Speech and Language Processing (3rd Edition, draft, Jan 6th 2026) by Dan Jurafsky and James H. Martin, available here.
Logistics: Lectures are on Thursday 10:30-12:30 (room Ce) and on Friday 10:30-12:30 (room Ce).
Office hours: Thursday 12:30-14:30, email appointment required. Meetings can be face-to-face or else on-line at this Zoom link.
-
Forum for general news and announcements. Only the lecturer can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
-
Forum for discussion of technical matter presented during the lectures. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
-
Forum for discussion of technical matter presented during the open laboratory sessions. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
-
Forum for project discussion. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.
-
-
March 5th, Thursday (10:30-12:30)
Course administration and presentation
- Content outline
- Laboratory sessions
- Course requirements
- Textbook
- Project
- Coursework
Natural language processing: An unexpected journey
- What is natural language processing?
- Very short history of natural language processing
- Why is natural language processing tricky?
- Ambiguity, composition, recursion and hidden structure
- Language & learning
- Miscellanea
References
- Slides from the lecture
Resources
-
March 6th, Friday (10:30-12:30)
Text normalization
- Word types and word tokens
- Herdan law and Zipf law
- Morphology and word-form
- Corpora
- Language identification and spell checking
- Text normalization: contraction, punctuation and special characters
- Word tokenization, character tokenization, and subword tokenization
- Byte-pair encoding algorithm: learner, encoder and decoder
References
- Jurafsky and Martin, chapter 2
Resources
-
March 12th, Thursday (10:30-12:30)
Text normalization
- Byte-pair encoding algorithm: learner, encoder and decoder (cont'd)
- Sentence segmentation and case folding
- Stop words, stemming and lemmatization
Words and meaning
- Lexical semantics
- Distributional semantics
- Count-based embeddings
- Word2vec and skip-gram
- Logistic regression
- Training
References
- Jurafsky and Martin, chapter 2
- Jurafsky and Martin, chapter 5
Resources
-
March 13th, Friday (10:30-12:30)
Words and meaning
- Training (cont'd)
- Practical issues
- FastText and GloVe
- Semantic properties of neural word embeddings
- Evaluation
- Cross-lingual word embeddings
References
- Jurafsky and Martin, chapter 5
- Voita, NLP Course | For You (web course): Word embeddings
-
March 19th, Thursday (10:30-12:30)
Statistical language models
- Language modeling: prediction and generation
- Language modeling applications
- Relative frequency estimation
- N-gram model
- N-gram probabilities and bias-variance trade-off
- Practical issues
- Evaluation: perplexity measure
- Sampling sentences
- Smoothing: Laplace and add-k smoothing
- Stupid backoff and linear interpolation
References
- Jurafsky and Martin, chapter 3
Resources
-
March 20th, Friday (10:30-12:30)
Statistical language models
- Out-of-vocabulary words
- Limitations of N-gram model
- Research papers
Neural language models (NLM)
- General architecture for NLM
- Feedforward NLM: inference
- Feedforward NLM: training
- Recurrent NLM: inference
Exercises
- Subword tokenization: BPE algorithm
References
- Jurafsky and Martin, chapter 3
- Jurafsky and Martin, section 6.5
- Voita, NLP Course | For You (web course): Language Modeling