Unit BIOINFORMATICS AND BIOSTATISTICS

Course

Biotechnology

Study-unit Code

GP004129

Curriculum

In all curricula

Teacher

Roberto Maria Pellegrino

Teachers

Roberto Maria Pellegrino

Hours

52 ore - Roberto Maria Pellegrino

CFU

Course Regulation

Coorte 2018

Offered

2020/21

Learning activities

Altro

Area

Abilità informatiche e telematiche

Academic discipline

BIO/11

Type of study-unit

Obbligatorio (Required)

Type of learning activities

Attività formativa monodisciplinare

Language of instruction

Italian and English

Contents

Theoretical Lessons:
Fundamentals of IT (Information Technology), Computers, algorithms, programs. Introduction to the programming environment R. Relational databases, Descriptive, inferential, multivariate statistics. Bioinformatics and biostatistics applications in «omics» sciences. Molecular evolution concepts. Alignments between sequences, use of genomic and proteomic web servers. Introduction to the study of proteomics with mass spectrometry

Laboratory activities:
Computer literacy, Use of spreadsheets, work with tables, produce graphics, statistical processing. Import of data and use of functions developed in environment R. Multivariate statistical analysis with MetaboAnalyst and other web applications. Sequence alignment with Dotlet, Searches for similarity with Blast and other activities on the genome based on NCBI webserver, Mass spectrometry applied to proteomics. Interpretation of mass spectra of peptides and protein sequencing strategies.

Reference texts

Manuela Helmer Citterich et al, Fondamenti di Bioinformatica. Ed Zanichelli.

Michael C. Whitlock, Dolph Schluter "Analisi statistica dei dati biologici", ed Zanichelli

Scientific articles from specialized journals will be provided in PDF format during the course.

Educational objectives

At the end of the course the student will know the architecture and the basic functions of the computers, the methods of implementation of the algorithms and will be able to evaluate the tractability by computerized means. He will know the structure and functioning of relational databases.

Using free software and web platforms, the student will be able to draw graphs and tables from univariate data to represent the analytical results of biological experiments and biometric surveys.

He will be able to size a biological experiment, interpret the result and evaluate the statistical significance of the result.

In the field of "omic" sciences, the student will be able to analyze complex matrices to determine the evanutal presence of latent variables and any correlations between samples. Using multivariate statistical analysis techniques, the student will also be able to identify biomarker candidates or determine the differentially expressed elements and project data on web platforms for Enrichment Analysis, Pathway Analysis and Joint Pathway Analysis.heat

In the field of genomics / proteomics the student will be able to discover homology relationships between sequences by means of global, local and multiple alignment algorithms. It will be able to query public databases for the search for homologous sequences and will be able to modulate the search parameters so as to obtain significant results for the research purposes. Furthermore, having identified a set of genes differentially expressed by a transcriptomic analysis, he will be able to interrogate the vocabularies of the Gene Onthology and interpret the result.

In the field of proteomics the student will be able to apply bioinformatic strategies for the interpretation of spectrometric data and project the results on web platforms for conducting biotechnological investigations.

Prerequisites

To take the course the student must have a good foundation in mathematics, organic chemistry, biochemistry, molecular biology.

Teaching methods

Lectures with explanations on the blackboard and slide show.

The laboratory activities will be carried out in the classroom with the help of personal computers

Other information

To actively participate in the laboratory hours and to do home exercises, students must install LibreOffice free "calc" software and "R" package on their personal computers.

Learning verification modality

Written test plus oral exam, as in the following:

Written exam (compulsory): 30 multiple choice questions. Maximum time 2 hours. Score: right answer = 1; wrong answer = - 0.5; no answer = 0.

Evaluation of the written test:
From 0 to 11.5 = insufficient skills, exam to be repeated in the next session;
From 12 to 20.5 = compulsory oral exam;
From 21 to 30 = optional oral exam.

Oral exam: 3 questions on different topics selected by lot.
Max 3 points for each question, oral vote: minimum 0, maximum 9.

Overall assessment: written grade + oral grade.
(Decimals approximated by excess).
final grade <17.5: insufficient skills, exam to be repeated in the next session.
Final grade >= 18: sufficient skills;
Final grade> 30: 30 and Praise

Extended program

1) Basic computer science elements: Computer architecture, Operating systems, Algorithms and programs, Programming languages, Introduction to the use of R, operations with variables, vectors and matrices,. Server and web server, Data bases, the relational model, normalization process, relational algebra and query of a relational database, Boolean Operators.

2) Elements of descriptive statistics: Definitions, populations and samples, types of sampling, types of data and variables, distribution of frequencies. Representation of frequency distribution, bar charts, pie charts, frequency tables and histograms for numerical data. Median and interquantile difference, boxplot representation, arithmetic mean and standard deviation, comparison of position and dispersion measurements. The normal distribution: Formula of the normal distribution and its properties, the standard normal distribution, statistical tables. Central limit theorem. Sampling distribution of an estimate, measuring the uncertainty of an estimate, confidence interval. Formulation, use and hypothesis testing: null hypothesis, alternative hypothesis. P-value Z-test, T-test, ANOVA, F-test, ROC analysis.

3) Multivariate Statistia Analysis: data matrix properties: data filtering, transformation and scaling. Covariance and covariance matrix, Hitmap graphic representation, ANOVA analysis, Volcano Plot. PCA, LDA, PLS analysis methods. Use of dedicated web platforms (MetaboAnalyst) and introduction to the use of statistical functions developed in R.

4) Biological and molecular evolution, molecular mechanisms underlying evolutionary processes, Homologous Genes, orthologues and paralogues.

5) Alignment and comparison between biological sequences, Global alignment of pairs of sequences, Dynamic programming, Replacement matrices, Local alignment of pairs of sequences, Searches by similarity in the database, BLAST: Input and output parameters, Significance of sequence alignments , Interpretation of results. Alignment of sequences to genomes, Multiple alignment of sequences.

6) Outline of the main nucleic acid sequencing platforms and genome reconstruction and annotation. Proteins and proteomes: Functional annotation of proteins, Databases: UNIPROT, PROSITE, ELM, PDB, PDBe, IntAct, MINT, STRING.

7) Proteomic analysis by mass spectrometry, interpretation of spectra, use of databases and web services dedicated to proteomics (Mascot).