Unit STATISTICS FOR DATA SCIENCE WITH R AND PYTHON

Course

Finance and quantitative methods for economics

Study-unit Code

A003078

Location

PERUGIA

Curriculum

Data science for finance and insurance

Teacher

Marco Doretti

Teachers

Marco Doretti

Hours

42 ore - Marco Doretti

CFU

Course Regulation

Coorte 2022

Offered

2022/23

Learning activities

Caratterizzante

Area

Matematico, statistico, informatico

Academic discipline

SECS-S/01

Type of study-unit

Obbligatorio (Required)

Type of learning activities

Attività formativa monodisciplinare

Language of instruction

English

Contents

Recap of statistical inference; maximum likelihood theory; hints on Bayesian inference; simple and multiple linear regression models; ordinary least square method; model diagnostics; inclusion of categorical explanatory variables and analysis of variance; introduction to generalized linear models; hints on logistic regression model; Poisson model for count data; numerical methods for maximum likelihood estimation of generalized linear models.

Reference texts

Alan Agresti, Maria Kateri (2021): Foundations of Statistics for Data Scientists (with R and Python). CRC Press, Chapman & Hall. ISBN: 9781003159834

Further material provided by the instructor

Educational objectives

Students will learn tools to correctly formulate the statistical models used in Data Science for the main types of outcome variables. The will learn how to estimate these models and to draw inferential conclusions based on the observed data. The aim of the course is also to illustrate the main diagnostic techniques for model selection, as well as general principles of statistical models (that often go beyond technicalities).

Prerequisites

Base knowledge of Descriptive Statistics (univariate and bivariate) and of Inferential Statistics (point estimation, interval estimation, hypothesis testing).

Teaching methods

Lectures on theory, practical sessions with statistical software.

Learning verification modality

Oral examination concerning theory as well as analysis of software output of fitted models.

Extended program

Recap on point and interval estimation: estimators' properties, confidence intervals. Inference on means, proportions, differences of means and differences of proportions. Sample size definition. Likelihood theory: definition of the likelihood function and parameter estimation through its maximization. Properties and examples for the parameters of main distributions. Hints on resampling methods (bootstrap) and Bayesian Inference: prior and posterior distributions, conjugate distributions. Relationships between hypothesis test and confidence intervals: likelihood ratio test and Wald test. Simple linear regression model: ordinary least square estimates, standard errors, effect interpretation, model diagnostics and goodness of. fit. Relationship between regression analysis and correlation. Multiple linear regression model: parameter estimation and standard errors, effect interpretation. Hints on causal analysis: distinction between associational and causal effects, spurious correlation. Correct specification of the functional form: higher-order effects and interactions. Diagnostics analysis: assumption checking and remedies to possible misspecification. Inference on linear models: F-test and t-test for global and local significance. Introduction of categorical explanatory variables and analysis of variance. Matrix formulation of linear models. Generalized linear models: introduction of the three key components and specification for the main distributions: Normal, Binomial, Poisson. Model deviance and likelihood ratio test. Model selection. Poisson model for count data. Numerical methods for the estimation of a generalized linear model: Newton-Raphson and Fisher Scoring algorithms.