Unit STATISTICS FOR DATA SCIENCE WITH R AND PYTHON
- Course
- Finance and quantitative methods for economics
- Study-unit Code
- A003078
- Location
- PERUGIA
- Curriculum
- Data science for finance and insurance
- Teacher
- Marco Doretti
- Teachers
-
- Marco Doretti
- Hours
- 42 ore - Marco Doretti
- CFU
- 6
- Course Regulation
- Coorte 2022
- Offered
- 2022/23
- Learning activities
- Caratterizzante
- Area
- Matematico, statistico, informatico
- Academic discipline
- SECS-S/01
- Type of study-unit
- Obbligatorio (Required)
- Type of learning activities
- Attività formativa monodisciplinare
- Language of instruction
- English
- Contents
- Recap of statistical inference; maximum likelihood theory; hints on Bayesian inference; simple and multiple linear regression models; ordinary least square method; model diagnostics; inclusion of categorical explanatory variables and analysis of variance; introduction to generalized linear models; hints on logistic regression model; Poisson model for count data; numerical methods for maximum likelihood estimation of generalized linear models.
- Reference texts
- Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834
Further material provided by the instructor - Educational objectives
- Students will learn tools to correctly formulate the statistical models used in Data Science for the main types of outcome variables. The will learn how to estimate these models and to draw inferential conclusions based on the observed data. The aim of the course is also to illustrate the main diagnostic techniques for model selection, as well as general principles of statistical models (that often go beyond technicalities).
- Prerequisites
- Base knowledge of Descriptive Statistics (univariate and bivariate) and of Inferential Statistics (point estimation, interval estimation, hypothesis testing).
- Teaching methods
- Lectures on theory, practical sessions with statistical software.
- Learning verification modality
- Oral examination concerning theory as well as analysis of software output of fitted models.
- Extended program
- Recap on point and interval estimation: estimators' properties, confidence intervals. Inference on means, proportions, differences of means and differences of proportions. Sample size definition. Likelihood theory: definition of the likelihood function and parameter estimation through its maximization. Properties and examples for the parameters of main distributions. Hints on resampling methods (bootstrap) and Bayesian Inference: prior and posterior distributions, conjugate distributions. Relationships between hypothesis test and confidence intervals: likelihood ratio test and Wald test. Simple linear regression model: ordinary least square estimates, standard errors, effect interpretation, model diagnostics and goodness of. fit. Relationship between regression analysis and correlation. Multiple linear regression model: parameter estimation and standard errors, effect interpretation. Hints on causal analysis: distinction between associational and causal effects, spurious correlation. Correct specification of the functional form: higher-order effects and interactions. Diagnostics analysis: assumption checking and remedies to possible misspecification. Inference on linear models: F-test and t-test for global and local significance. Introduction of categorical explanatory variables and analysis of variance. Matrix formulation of linear models. Generalized linear models: introduction of the three key components and specification for the main distributions: Normal, Binomial, Poisson. Model deviance and likelihood ratio test. Model selection. Poisson model for count data. Numerical methods for the estimation of a generalized linear model: Newton-Raphson and Fisher Scoring algorithms.