Management and Analysis of Physics Datasets
Period: Second semester
Course unit contents:
Part 1) Data Management
- Introducing data structures
- Exploring storage models
- Storage reliability data preservation
- Understanding security in data management
- Scalability principles for storage
- Comparing local and distributed file systems
- Examining database management principles
- Managing and retrieving data from Relational databases (MySQL)
Part 2) Data processing
- Basics of computing processing and limitations of single-threaded CPUs
- Introduction to threading and parallel processing techniques
- Overview of basic parallelization patterns in Python
- Understanding distributed computing systems
- Hadoop as a paradigm for big data processing
- Implementing data processing with Apache Spark
- Employing Dask for data processing tasks
- Understanding Apache Kafka as a distributed streaming platform
For the hands-on sessions on both parts:
- Basics of containerization methodologies (Docker)
Planned learning activities and teaching methods:
Frontal lectures for the introductory topics. Examples and usecases
Hands-on sessions with live-coding examples run by the lecturers.
Exercises and examples to be done in the IT lab.