Unit STATISTICAL METHODS FOR DATA SCIENCE

Course

Informatics

Study-unit Code

A002172

Curriculum

In all curricula

Teacher

Luca Scrucca

Teachers

Luca Scrucca

Hours

42 ore - Luca Scrucca

CFU

Course Regulation

Coorte 2022

Offered

2023/24

Learning activities

Affine/integrativa

Area

Attività formative affini o integrative

Academic discipline

MAT/06

Type of study-unit

Opzionale (Optional)

Type of learning activities

Attività formativa monodisciplinare

Language of instruction

English

Contents

Advanced statistical methods for Statistical and Machine Learning, both supervised (classification and regression) and unsupervised (clustering and dimension reduction). Real data case studies introduced and analyzed using the software R.

Reference texts

James G., Witten D., Hastie T., Tibshirani R. (2021) An Introduction to Statistical Learning with Applications in R, 2nd edition, Springer-Verlag (freely available at https://www.statlearning.com)
Course slides available on the UniStudium webpage of the course.

Educational objectives

The course provides an introduction to the main methods and techniques of Statistical and Machine Learning for Data Science, both in the supervised (regression and classification) and unsupervised (clustering and reduction of dimensionality) cases.
The main knowledge acquired will be:
• introductory concepts and specific statistical and machine learning models;
• evaluation by resampling techniques of the predictive accuracy of both regression and classification models.
The main skills (i.e. the ability to apply the knowledge acquired) will be:
• independently apply the appropriate methods and algorithms to real data regression, classification and clustering problems;
• analyze data using the R software for the estimation of supervised and unsupervised models.

Prerequisites

The presentation of the models and algorithms covered during the course makes use of basic knowledge of statistics, both descriptive and inferential, and of the linear regression model. Familiarity with the R software environment for statistical computing and graphics is recommended although not essential for understanding the topics covered.

Teaching methods

Lectures and practical sessions with the use of R.

Other information

Attending classes is strongly advised.

Learning verification modality

Progress assessments and final oral exam. The computer laboratory activities are aimed to assess the student's ability to put into practice the methods introduced in the classroom. Final oral examination instead intends to assess the level of knowledge and understanding achieved by the student regarding the computational and methodological aspects covered during the course.

Extended program

The course aims at presenting advanced statistical methods for Data Science, both supervised (classification and regression) and unsupervised (clustering and dimension reduction). These methods have been successfully applied in many fields, from finance to economy, from business analytics to natural and social sciences. Real data case studies will be introduced and analysed using the statistical software R.
Specifically, the following topics will be covered:
- Statistical and machine learning: introduction.
- Prediction vs interpretation.
- Supervised vs unsupervised learning.
- Classification vs regression.
- Evaluating the accuracy of a statistical model.
- Supervised learning: introduction.
- Extensions to the linear model: model selection and regularisation. Polynomial regression.
- Resampling methods: cross-validation and bootstrap.
- Classification: introduction.
- Logistic model and multinomial model.
- Linear and quadratic discriminant analysis.
- Gaussian naive Bayes.
- Gaussian finite mixture models.
- k-nearest neighbour algorithm.
- Advanced methods for regression and classification.
- Generalized Additive Models.
- Artificial neural networks.
- Decision trees.
- Bagging.
- Random forests.
- Boosting.
- Unsupervised learning: introduction.
- Principal component analysis.
- Similarity measures and distance matrix.
- Cluster analysis: hierarchical methods.
- Non-hierarchical methods (k-means).
- Model-based clustering.

Obiettivi Agenda 2030 per lo sviluppo sostenibile