Unit STATISTICS FOR DATA SCIENCE WITH R AND PYTHON
 Course
 Finance and quantitative methods for economics
 Studyunit Code
 A003079
 Location
 PERUGIA
 Curriculum
 Statistical data science for finance and economics
 Teacher
 Marco Doretti
 CFU
 12
 Course Regulation
 Coorte 2023
 Offered
 2023/24
 Type of studyunit
 Obbligatorio (Required)
 Type of learning activities
 Attività formativa integrata
Modulo I Generalized linear models
Code  A003092 

Location  PERUGIA 
CFU  6 
Teacher  Marco Doretti 
Teachers 

Hours 

Learning activities  Caratterizzante 
Area  Matematico, statistico, informatico 
Academic discipline  SECSS/01 
Type of studyunit  Obbligatorio (Required) 
Language of instruction  English 
Contents  Recap on probability and statistical inference; likelihood theory; simple and multiple linear regression; ordinary least squares; model diagnostic; inclusion of categorical explanatory variables and analysis of variance; introduction to generalized linear models; hints on logistic regression; Poisson model for count data; numerical methods for maximum likelihood estimation of generalized linear models. 
Reference texts  Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834 Further material provided by the instructor 
Educational objectives  Students will learn tools to correctly formulate the statistical models used in Data Science for the main types of outcome variables. The will learn how to estimate these models and to draw inferential conclusions based on the observed data. The aim of the course is also to illustrate the main diagnostic techniques for model selection, as well as general principles of statistical models (that often go beyond technicalities). 
Teaching methods  Lectures on theory, practical sessions with statistical software. 
Learning verification modality  Oral examination concerning theory as well as analysis of software output of fitted models. 
Extended program  Recap on probability and inference: main random variables and their moments. Properties of estimators, confidence intervals and hypothesis tests for means, proportions, differences of means and proportions. Likelihood theory: definition of the likelihood function and parameter estimation through its maximization. Properties and examples for the parameters of main distributions. Hints on resampling methods like bootstrap. Likelihood ratio test, score test and Wald test. Simple linear regression model: parameter estimate with ordinary least squares, estimation of standard errors, effect interpretation, model diagnostic and goodness of fit. Relationship between regression and correlation analysis. Multiple linear regression model: parameter and standard error estimation, effect interpretation. Hints on causal analysis: distinction between association and causation, spurious correlation. Correct model specification: higherorder effects and interactions. Diagnostic: check of assumptions and remedies to possible violations. Inference on linear model: ttest and F test for local and global significance. Introduction of categorical explanatory variables and analysis of variance testing. Matrix form of linear models. Generalized linear models: definition of the three key components and specification for the main distributions: Normal, Binomial and Poisson. Model deviance and likelihood ratio test. Model selection. Poisson model for count data. Numerical methods for parameter estimation in a generalized linear model: NewtonRaphson and Fisher scoring algorithm. 
Credit scoring
Code  A003093 

Location  PERUGIA 
CFU  6 
Teacher  Elena Stanghellini 
Teachers 

Hours 

Learning activities  Caratterizzante 
Area  Matematico, statistico, informatico 
Academic discipline  SECSS/01 
Type of studyunit  Obbligatorio (Required) 
Language of instruction  English 
Contents  Major classification tools: logistic regression and discriminant analysis. These techniques will be implemented to the Credit Scoring context. Theoretical and practical notions of Credit Scoring will therefore be defined. Definition and phases; probability and independence of random variables; logistic models as classifiers; ROC and CAP curves and other validation methods. Rare outcome and retrospective sampling for unbalanced data. 
Reference texts  Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834 Lecture notes in English (translation of the Italian book Stanghellini (2009) Introduzione ai metodi statistici per il Credit Scoring  Springer Italia, Capp: 15.) 
Educational objectives  A major part of Data Science concerns classification. Students will acquire knowledge of the major parametric techniques of classification. The techniques will be applied to the Credit Scoring Context, to measure the probability of default of a credit position. The analysis of real data and of case studies through the software R and Python will give the students confidence on how to perform a data analysis in this context and to construct a statistical model to actually measure the risk of default. 
Prerequisites  In order to successfully complete the module, students should have completed the first module Generalized Linear Models (or any other advanced statistics course with analogous content). To be more specific: students should have successfully completed a module with Multiple Linear Regression covering: a) assumptions and unknown parameters; b) inferential procedures to estimate the parameters: Ordinary Least Squares, Maximum Likelihood; c) Sampling distribution of the estimators. Large sample distributions of the estimators; d) Confidence intervals. Hypothesis testing: on the parameter, on the model. Ftest for the model; e) Heteroskedasticity: problems and inference in heteroskedastic models. 
Teaching methods  There will be four hours of lectures and two hours of practical exercises in the computer lab (weekly). Students are strongly advised to attend the lectures and the excercises. Furthermore, every two/three weeks, students are proposed an homework. The homework may be completed in groups of 3 or 4 students. The partecipation of the homweork scheme exempt the students from providing the document 3 days prior the exams session (see Learning Verification Modality/Modalità di verifica dell'apprendimento below). Students are strongly advised to join the scheme. 
Other information  Incoming students in Erasmus and other Exchange programs are most welcome. 
Learning verification modality  Oral examination on both the theoretical aspects covered during the lectures and their application to real data analysis. Students are requested to complete a written report of the analysis on some given datasets, following the instructions on the file uploaded on the web page of the course in Unistudium. This document should be sent to the instructor via email three days before the exam date. Students that attend the lectures may subscribe to the programme of regular homeworks to be completed on an forthnight base. Students may do these exercises in groups. The exercises will be provided by the instructor during the lecturing time and involve solving real problem on real data. This will substitute the above requested written document. 
Extended program  Credit Scoring as a classification problem. Phases of Credit Scoring. Classification errors and the choice of the cutoff. ROC and CAP curves. Training and validation sample. Categorical Random variables. Independence. Logistic model as a generalized linear model. Interpretation of parameters. Maximum likelihood estimation of the parameters. Confidence intervals and Hypothesis testing on the training sample. Validation sample: the confusion matrix, the HosmerLemeshow test. Restrospective sampling and rebalancing techniques. Linear and Quadratic Discriminant analysis. Estimating the parameters of the discriminant function; the plugin method. Implementation of the techniques through the software R and Python for statistical computing will also be part of the course. 