Unit STATISTICS FOR DATA SCIENCE WITH R AND PYTHON

Course: Finance and quantitative methods for economics
Study-unit Code: A003079
Location: PERUGIA
Curriculum: Statistical data science for finance and economics
Teacher: Marco Doretti
CFU: 12
Course Regulation: Coorte 2022
Offered: 2022/23
Type of study-unit: Obbligatorio (Required)
Type of learning activities: Attività formativa integrata

Modulo I Generalized linear models

Code	A003092
Location	PERUGIA
CFU	6
Teacher	Marco Doretti
Teachers	Marco Doretti
Hours	42 ore - Marco Doretti
Learning activities	Caratterizzante
Area	Matematico, statistico, informatico
Academic discipline	SECS-S/01
Type of study-unit	Obbligatorio (Required)
Language of instruction	English
Contents	Recap of statistical inference; maximum likelihood theory; hints on Bayesian inference; simple and multiple linear regression models; ordinary least square method; model diagnostics; inclusion of categorical explanatory variables and analysis of variance; introduction to generalized linear models; hints on logistic regression model; Poisson model for count data; numerical methods for maximum likelihood estimation of generalized linear models.
Reference texts	Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834 Further material provided by the instructor
Educational objectives	Students will learn tools to correctly formulate the statistical models used in Data Science for the main types of outcome variables. The will learn how to estimate these models and to draw inferential conclusions based on the observed data. The aim of the course is also to illustrate the main diagnostic techniques for model selection, as well as general principles of statistical models (that often go beyond technicalities).
Prerequisites	Base knowledge of Descriptive Statistics (univariate and bivariate) and of Inferential Statistics (point estimation, interval estimation, hypothesis testing).
Teaching methods	Lectures on theory, practical sessions with statistical software.
Learning verification modality	Oral examination concerning theory as well as analysis of software output of fitted models.
Extended program	Recap on point and interval estimation: estimators' properties, confidence intervals. Inference on means, proportions, differences of means and differences of proportions. Sample size definition. Likelihood theory: definition of the likelihood function and parameter estimation through its maximization. Properties and examples for the parameters of main distributions. Hints on resampling methods (bootstrap) and Bayesian Inference: prior and posterior distributions, conjugate distributions. Relationships between hypothesis test and confidence intervals: likelihood ratio test and Wald test. Simple linear regression model: ordinary least square estimates, standard errors, effect interpretation, model diagnostics and goodness of. fit. Relationship between regression analysis and correlation. Multiple linear regression model: parameter estimation and standard errors, effect interpretation. Hints on causal analysis: distinction between associational and causal effects, spurious correlation. Correct specification of the functional form: higher-order effects and interactions. Diagnostics analysis: assumption checking and remedies to possible misspecification. Inference on linear models: F-test and t-test for global and local significance. Introduction of categorical explanatory variables and analysis of variance. Matrix formulation of linear models. Generalized linear models: introduction of the three key components and specification for the main distributions: Normal, Binomial, Poisson. Model deviance and likelihood ratio test. Model selection. Poisson model for count data. Numerical methods for the estimation of a generalized linear model: Newton-Raphson and Fisher Scoring algorithms.

Credit scoring

Code	A003093
Location	PERUGIA
CFU	6
Teacher	Elena Stanghellini
Teachers	Elena Stanghellini
Hours	42 ore - Elena Stanghellini
Learning activities	Caratterizzante
Area	Matematico, statistico, informatico
Academic discipline	SECS-S/01
Type of study-unit	Obbligatorio (Required)
Language of instruction	English
Contents	Classification tools: logistic regression and discriminant analysis. These techniques will be implemented to the Credit Scoring context. Theoretical and practical notions of Credit Scoring will therefore be defined. Definition and phases; probability and independence; logistic models as classifiers; ROC and CAP curves and other validation methods.
Reference texts	Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834 Lecture notes in English (translation of the Italian book Stanghellini (2009) Introduzione ai metodi statistici per il Credit Scoring -- Springer Italia, Capp: 1-5.)
Educational objectives	A major part of the Data Scientist concerns classification. Students will acquire knowledge of the major statistic parametric techniques of classification. The techniques will be applied to the Credit Scoring Context, to measure the probability of default of a credit position. The analysis of real data and of case studies through the software R and Python will give the students confidence on how to perform a data analysis in this context and learn how to buld a statistical model to actually measure the risk of default.
Prerequisites	In order to successfully complete the module, students should have completed the module Generalized Linear Models (or any other advanced statistics course with analogous content). To be more specific: students should have successfully completed a module with Multiple Linear Regression covering: a) assumptions and unknown parameters; b) inferential procedures to estimate the parameters: Ordinary Least Squares, Maximum Likelihood; c) Sampling distribution of the estimators. Large sample distributions of the estimators; d) Confidence intervals. Hypothesis testing: on the parameter, on the model. F-test for the model; e) Heteroskedasticity: problems and inference in heteroskedastic models.
Teaching methods	There will be four hours of lectures and two hours of practical exercises in the computer lab (weekly). Students are strongly advised to attend the lectures and the excercises. Furthermore, every two/three weeks, students are proposed an homework. The homework may be completed in groups of 3 or 4 students. The partecipation of the homweork scheme exempt the students from providing the document 3 days prior the exams session (see Modalità di verifica dell'apprendimento below). Students are strongly advised to join the scheme.
Other information	Incoming students in Erasmus and other Exchange programs are most welcome.
Learning verification modality	Oral examination on both the theoretical aspects covered during the lectures and their application to real data analysis. Students are requested to complete a written report of the analysis on some given datasets, following the instructions on the file uploaded on the web page of the course in Unistudium. This document should be sent to the instructor via email three days before the exam date. Students that attend the lectures may subscribe to the programme of regular homeworks to be completed on an forthnight base. Students may do these exercises in groups. The exercises will be provided by the instructor during the lecturing time and involve solving real problem on real data. This will substitute the above requested written document.
Extended program	Logistic model as a generalized linear model. Interpretation of parameters. Maximum likelihood estimation of the parameters. Confidence intervals and Hypothesis testing. Phases of Credit Scoring. Classification errors. Tools for assessing the efficacy of the classifier and the accuracy of the predictors are presented, such as the ROC and CAP curves, the confusion matrix, the Hosmer-Lemeshow test. Restrospective sampling and rebalancing techniques. Discriminant analysis. Implementation of the techniques through the software R and Python for statistical computing will also be part of the course.

Il Portale utilizza cookie tecnici in forma anonima, per migliorare l'esperienza di navigazione e cookie tecnici analitici in forma aggregata e anonima, per la raccolta di informazioni statistiche sulle modalità di utilizzo, entrambi necessari. Selezionando "Accetto" si dà il consenso all'utilizzo di cookie di profilazione di terze parti. Selezionando "Non accetto" non sarà possibile utilizzare il servizio "Cerca nel Portale" o altri servizi che utilizzano cookie di profilazione, mentre sarà possibile continuare la navigazione.
Ulteriori informazioni nell'informativa estesa