Unit STATISTICAL METHODS FOR DATA SCIENCE

Course
Informatics
Study-unit Code
A002172
Curriculum
In all curricula
Teacher
Luca Scrucca
Teachers
  • Luca Scrucca
Hours
  • 42 ore - Luca Scrucca
CFU
6
Course Regulation
Coorte 2022
Offered
2023/24
Learning activities
Affine/integrativa
Area
Attività formative affini o integrative
Academic discipline
MAT/06
Type of study-unit
Opzionale (Optional)
Type of learning activities
Attività formativa monodisciplinare
Language of instruction
English
Contents
Advanced statistical methods for Statistical and Machine Learning, both supervised (classification and regression) and unsupervised (clustering and dimension reduction). Real data case studies introduced and analyzed using the software R.
Reference texts
James G., Witten D., Hastie T., Tibshirani R. (2021) An Introduction to Statistical Learning with Applications in R, 2nd edition, Springer-Verlag (freely available at https://www.statlearning.com)
Course slides available on the UniStudium webpage of the course.
Educational objectives
The course provides an introduction to the main methods and techniques of Statistical and Machine Learning for Data Science, both in the supervised (regression and classification) and unsupervised (clustering and reduction of dimensionality) cases.
The main knowledge acquired will be:
• introductory concepts and specific statistical and machine learning models;
• evaluation by resampling techniques of the predictive accuracy of both regression and classification models.
The main skills (i.e. the ability to apply the knowledge acquired) will be:
• independently apply the appropriate methods and algorithms to real data regression, classification and clustering problems;
• analyze data using the R software for the estimation of supervised and unsupervised models.
Prerequisites
The presentation of the models and algorithms covered during the course makes use of basic knowledge of statistics, both descriptive and inferential, and of the linear regression model. Familiarity with the R software environment for statistical computing and graphics is recommended although not essential for understanding the topics covered.
Teaching methods
Lectures and practical sessions with the use of R.
Other information
Attending classes is strongly advised.
Learning verification modality
Progress assessments and final oral exam. The computer laboratory activities are aimed to assess the student's ability to put into practice the methods introduced in the classroom. Final oral examination instead intends to assess the level of knowledge and understanding achieved by the student regarding the computational and methodological aspects covered during the course.
Extended program
The course aims at presenting advanced statistical methods for Data Science, both supervised (classification and regression) and unsupervised (clustering and dimension reduction). These methods have been successfully applied in many fields, from finance to economy, from business analytics to natural and social sciences. Real data case studies will be introduced and analysed using the statistical software R.
Specifically, the following topics will be covered:
- Statistical and machine learning: introduction.
- Prediction vs interpretation.
- Supervised vs unsupervised learning.
- Classification vs regression.
- Evaluating the accuracy of a statistical model.
- Supervised learning: introduction.
- Extensions to the linear model: model selection and regularisation. Polynomial regression.
- Resampling methods: cross-validation and bootstrap.
- Classification: introduction.
- Logistic model and multinomial model.
- Linear and quadratic discriminant analysis.
- Gaussian naive Bayes.
- Gaussian finite mixture models.
- k-nearest neighbour algorithm.
- Advanced methods for regression and classification.
- Generalized Additive Models.
- Artificial neural networks.
- Decision trees.
- Bagging.
- Random forests.
- Boosting.
- Unsupervised learning: introduction.
- Principal component analysis.
- Similarity measures and distance matrix.
- Cluster analysis: hierarchical methods.
- Non-hierarchical methods (k-means).
- Model-based clustering.
Obiettivi Agenda 2030 per lo sviluppo sostenibile

Condividi su