Unit STATISTICAL METHODS FOR DATA SCIENCE
- Course
- Informatics
- Study-unit Code
- A002172
- Curriculum
- In all curricula
- Teacher
- Luca Scrucca
- Teachers
-
- Luca Scrucca
- Hours
- 42 ore - Luca Scrucca
- CFU
- 6
- Course Regulation
- Coorte 2023
- Offered
- 2024/25
- Learning activities
- Affine/integrativa
- Area
- Attività formative affini o integrative
- Academic discipline
- MAT/06
- Type of study-unit
- Opzionale (Optional)
- Type of learning activities
- Attività formativa monodisciplinare
- Language of instruction
- English
- Contents
- Advanced statistical methods for Statistical and Machine Learning, both supervised (classification and regression) and unsupervised (clustering and dimension reduction). Real data case studies introduced and analyzed using the software R.
- Reference texts
- James G., Witten D., Hastie T., Tibshirani R. (2021) An Introduction to Statistical Learning with Applications in R, 2nd edition, Springer-Verlag (freely available at https://www.statlearning.com)
Course slides available on the UniStudium webpage of the course. - Educational objectives
- The course provides an introduction to the main methods and techniques of Statistical and Machine Learning for Data Science, both in the supervised (regression and classification) and unsupervised (clustering and reduction of dimensionality) cases.
The main knowledge acquired will be:
• introductory concepts and specific statistical and machine learning models;
• evaluation by resampling techniques of the predictive accuracy of both regression and classification models.
The main skills (i.e. the ability to apply the knowledge acquired) will be:
• independently apply the appropriate methods and algorithms to real data regression, classification and clustering problems;
• analyze data using the R software for the estimation of supervised and unsupervised models. - Prerequisites
- The presentation of the models and algorithms covered during the course makes use of basic knowledge of statistics, both descriptive and inferential, and of the linear regression model. Familiarity with the R software environment for statistical computing and graphics is recommended although not essential for understanding the topics covered.
- Teaching methods
- Lectures and practical sessions with the use of R.
- Other information
- Attending classes is strongly advised.
- Learning verification modality
- Progress assessments and final oral exam. The computer laboratory activities are aimed to assess the student's ability to put into practice the methods introduced in the classroom. Final oral examination instead intends to assess the level of knowledge and understanding achieved by the student regarding the computational and methodological aspects covered during the course.
- Extended program
- The course aims at presenting advanced statistical methods for Data Science, both supervised (classification and regression) and unsupervised (clustering and dimension reduction). These methods have been successfully applied in many fields, from finance to economy, from business analytics to natural and social sciences. Real data case studies will be introduced and analysed using the statistical software R.
Specifically, the following topics will be covered:
- Statistical and machine learning: introduction.
- Prediction vs interpretation.
- Supervised vs unsupervised learning.
- Classification vs regression.
- Evaluating the accuracy of a statistical model.
- Supervised learning: introduction.
- Extensions to the linear model: model selection and regularisation. Polynomial regression.
- Resampling methods: cross-validation and bootstrap.
- Classification: introduction.
- Logistic model and multinomial model.
- Linear and quadratic discriminant analysis.
- Gaussian naive Bayes.
- Gaussian finite mixture models.
- k-nearest neighbour algorithm.
- Advanced methods for regression and classification.
- Generalized Additive Models.
- Artificial neural networks.
- Decision trees.
- Bagging.
- Random forests.
- Boosting.
- Unsupervised learning: introduction.
- Principal component analysis.
- Similarity measures and distance matrix.
- Cluster analysis: hierarchical methods.
- Non-hierarchical methods (k-means).
- Model-based clustering. - Obiettivi Agenda 2030 per lo sviluppo sostenibile