Management and Analysis of Physics Datasets
Period: Second semester
Course unit contents:
Part 1) Data Management
- Introducing data structures
- Exploring storage models
- Storage reliability data preservation
- Understanding security in data management
- Scalability principles for storage
- Comparing local and distributed file systems
- Examining database management principles
- Managing and retrieving data from Relational databases (MySQL)
Part 2) Data processing
- Basics of computing processing and limitations of single-threaded CPUs
- Introduction to threading and parallel processing techniques
- Overview of basic parallelization patterns in Python
- Understanding distributed computing systems
- Hadoop as a paradigm for big data processing
- Implementing data processing with Apache Spark
- Employing Dask for data processing tasks
- Understanding Apache Kafka as a distributed streaming platform
For the hands-on sessions on both parts:
- Basics of containerization methodologies (Docker)
Planned learning activities and teaching methods:
Frontal lectures for the introductory topics. Examples and usecases
Hands-on sessions with live-coding examples run by the lecturers.
Exercises and examples to be done in the IT lab.
(The use of generative-AI tools is permitted for general assistance in understanding course content and organizing study materials, but should not substitute in any way individual study, problem-solving practice, or interaction with the lecturer. Any use must be clearly declared and must comply with University of Padova and the Physics of Data policies on academic integrity)
(In addition to contacting the course instructor, students with disabilities, Specific Learning Disorders (SLD), Special Educational Needs (SEN), and other health conditions can reach out to the Student Services Office - Inclusion Unit to receive more information about opportunities to access teaching with specific support and tools)