Section: Day 02 | NATURAL LANGUAGE PROCESSING 2025-2026 - INQ0091105 | STEM

Macroarea STEM

Home Calendar Unipd Educational Offer Timetables Uniweb Webmail My Media

Section outline

March 6th, Friday (10:30-12:30)

Text normalization

Word types and word tokens

Herdan law and Zipf law

Morphology and word-form

Corpora

Language identification and spell checking

Text normalization: contraction, punctuation and special characters

Word tokenization, character tokenization, and subword tokenization

Byte-pair encoding algorithm: learner, encoder and decoder

References

Jurafsky and Martin, chapter 2

Resources
- Select activity 03_text_normalization
  
  03_text_normalization File