Unit STATISTICS FOR DATA SCIENCE WITH R AND PYTHON

Course: Finance and quantitative methods for economics
Study-unit Code: A003079
Location: PERUGIA
Curriculum: Statistical data science for finance and economics
Teacher: Elena Stanghellini
CFU: 12
Course Regulation: Coorte 2024
Offered: 2024/25
Type of study-unit: Obbligatorio (Required)
Type of learning activities: Attività formativa integrata

Modulo I Generalized linear models

Code	A003092
Location	PERUGIA
CFU	6
Teacher	Simone Del Sarto
Teachers	Simone Del Sarto
Hours	42 ore - Simone Del Sarto
Learning activities	Caratterizzante
Area	Matematico, statistico, informatico
Academic discipline	SECS-S/01
Type of study-unit	Obbligatorio (Required)
Language of instruction	English
Contents	Recalls of probability and statistical inference; maximum likelihood theory; simple and multiple linear regression models; method of least squares; model diagnostics; inclusion of categorical explanatory variables and analysis of variance; introduction to generalised linear models; mention of logistic regression model; Poisson model for count data; numerical methods for maximum likelihood estimation of generalised linear models.
Reference texts	Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834
Educational objectives	Students will learn the tools for correctly formulating statistical models used for the main types of response variables, learning how to estimate them and draw inferential conclusions based on observed data. The course also aims to illustrate basic diagnostic techniques for model selection, while conveying the guiding principles of statistical modelling (which often go beyond technicalities).
Prerequisites	Basic knowledge of univariate and bivariate descriptive statistics, probability theory (main random variables and their mass/ probability density functions, expected values, variances etc.) and inferential statistics (point estimation, confidence intervals, hypothesis testing).
Teaching methods	Frontal theoretical lectures, practical sessions with the use of suitable software.
Learning verification modality	Oral examination with questions on theory topics; analysis and commentary on software output with estimation of models covered in the course.
Extended program	Recalls of probability and statistical inference: main random variables and their moments. Properties of estimators, confidence intervals and hypothesis tests for means and proportions. Likelihood theory: definition of the likelihood function and estimation of parameters through its maximisation. Properties and examples for the parameters of the main distributions. Hints at bootstrap resampling methods. Likelihood ratio test and Wald test. Simple linear regression model: parameter estimation by least squares method, standard error estimation, interpretation of effects, model diagnostics and goodness of fit. Relationship between regression analysis and linear correlation. Multiple linear regression model: parameter estimation and standard errors, interpretation of effects. Proper specification of the functional form of the model: higher-order effects and interactions. Diagnostic analysis: checking the assumptions underlying the model and remedies for possible violations. Inference on the linear model: F-tests and t-tests for global and local significance. Introduction of categorical explanatory variables and analysis of variance tests. Matrix formulation of linear models. Generalised linear models: introduction of the three key components and specification for the major distributions: Normal, Binomial and Poisson. Model deviance and test of the likelihood ratio. Model selection. Poisson model for count data. Numerical methods for estimating the parameters of a generalised linear model: Newton-Raphson algorithm and Fisher scoring.

Credit scoring

Code	A003093
Location	PERUGIA
CFU	6
Teacher	Elena Stanghellini
Teachers	Elena Stanghellini
Hours	42 ore - Elena Stanghellini
Learning activities	Caratterizzante
Area	Matematico, statistico, informatico
Academic discipline	SECS-S/01
Type of study-unit	Obbligatorio (Required)
Language of instruction	English
Contents	Classification tools: logistic regression and discriminant analysis. These techniques will be implemented to the Credit Scoring context. Theoretical and practical notions of Credit Scoring will therefore be defined. Definition and phases; probability and independence; logistic models as classifiers; ROC and CAP curves and other validation methods (such as Hosmer-Lemeshow test). Rare outcome and retrospective sampling for unbalanced data will also be addressed.
Reference texts	Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834 Lecture notes in English (translation of the Italian book Stanghellini (2009) Introduzione ai metodi statistici per il Credit Scoring -- Springer Italia, Capp: 1-5.)
Educational objectives	A major part of Data Science concerns classification. Students will acquire knowledge of the major parametric techniques of classification. The techniques will be applied to the Credit Scoring Context, to measure the probability of default of a credit position. The analysis of real data and of case studies through the software R and Python will give the students confidence on how to perform a data analysis in this context and to construct a statistical model to actually measure the risk of default.
Prerequisites	In order to successfully complete the module, students should have completed the first module Generalized Linear Models (or any other advanced statistics course with analogous content). To be more specific: students should have successfully completed a module with Multiple Linear Regression covering: a) assumptions and unknown parameters; b) inferential procedures to estimate the parameters: Ordinary Least Squares, Maximum Likelihood; c) Sampling distribution of the estimators. Large sample distributions of the estimators; d) Confidence intervals. Hypothesis testing: on the parameter, on the model. F-test for the model; e) Heteroskedasticity: problems and inference in heteroskedastic models.
Teaching methods	There will be four hours of lectures and two hours of practical exercises in the computer lab (weekly). Students are strongly advised to attend the lectures and the excercises. Furthermore, every two/three weeks, students are proposed an homework. The homework may be completed in groups of 3 or 4 students. The partecipation of the homweork scheme exempt the students from providing the document 3 days prior the exams session (see Learning Verification Modality/Modalità di verifica dell'apprendimento below). Students are strongly advised to join the scheme.
Other information	Incoming students within Erasmus and other Exchange programs are most welcome.
Learning verification modality	Oral examination on both the theoretical aspects covered during the lectures and their application to real data analysis. Students are requested to complete a written report of the analysis on some given datasets, following the instructions on the file uploaded on the web page of the course in Unistudium. This document should be sent to the instructor via email three days before the exam date. Students that attend the lectures may subscribe to the programme of regular homeworks to be completed on an forthnight base. Students may do these exercises in groups. The exercises will be provided by the instructor during the lecturing time and involve solving real problem on real data. This will substitute the above requested written document.
Extended program	Credit Scoring as a classification problem. Phases of Credit Scoring. Classification errors and the choice of the cut-off. ROC and CAP curves. Training and validation sample. Categorical Random variables. Independence. Logistic model as a generalized linear model. Interpretation of parameters. Maximum likelihood estimation of the parameters. Confidence intervals and Hypothesis testing on the training sample. Logistic model as classifier. Validation sample: the confusion matrix, the Hosmer-Lemeshow test. Restrospective sampling and rebalancing techniques. Linear and Quadratic Discriminant analysis. Estimating the parameters of the discriminant function; the plug-in method. Monitoring the score over time. Reject inference. Implementation of the techniques through the software R and Python for statistical computing will also be part of the course.
Obiettivi Agenda 2030 per lo sviluppo sostenibile	The module contributes to the achievement of Goal no. 4 "Quality education" of the 2030 Agenda for Sustainable Development, as it provides tools for the critical analysis of data in finance and economics, a crucial aspect in the era of BIG DATA.