Statistics 410-01 Bioinformatics Spring 2003

Statistics 410-01 Bioinformatics Spring 2003

Time and Place: Tu Th 12:30-1:45, CLAS 344

Instructor: Professor Lynn Kuo

Email: lynn@stat.uconn.edu

Office: CLAS 330

Phone: 486-2951

Office hours: Tu Th 2-3:30

Guest Lecturers:

Dr. Winfried Krueger, supervisor of the Genomics Core facility , UCONN Health Center,

email: WKRUEGER@PANDA.UCHC.EDU

Mr. Pascal Lapierre, researcher at the lab of Professor Peter Gogarten (Molecular and Cell Biology), email: Pascal.Lapierre@Huskymail.uconn.edu

Professor Dong-Guk Shin of the Computer Science Department, email: shin@engr.uconn.edu

Required Textbooks:

(1) Durbin, R. Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis Probabilities models of proteins and nucleic acids. Cambridge University Press.

(2) Greg Gibson and Spencer V. Muse (2002). A Primer of Genome Science, Sinauer Associates, Inc.

(3) M. Kanehisa (2000). Post-genome informatics, Oxford University Press.

Recommended Texts:

(1) Michael S. Waterman (1995). Introduction to Computational Biology, Maps, sequences and genomes. Chapman & Hall.

(2) Pierre Baldi and Wesley Hatfield (2002). DNA Microarrays and Gene Expression from Experiments to Data Analysis and Modeling. Cambridge University Press.

(3) Rex A. Dwyer (2002). Genomic Perl From Concepts to Working Code. Cambridge University Press.

(4) Editor: M. Schena (1999). DNA Microarrays, Oxford University Press.

(5) Editor: Terry Speed (2002). Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC.

(6) Editors: Andreas D. Baxevanis and B. F. Francis Ouellette (2001). Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins.2nd Ed. Wiley.

(7) Hastie T, Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag.

(8) Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques, Academic Press.

(9) Lodish et al. (2000). Molecular Cell Biology, 4th edition, Freeman.

(10) David P. Clark and Lonnie D. Russell (2000). Molecular Biology made simple and fun, Cache River Press.

This is a research-oriented interdisciplinary course. Preference will be given to Ph.D. students in Molecular and Cell Biology or its related field, Computer Science and Statistics. M.S. students in these fields will be accepted provided space is available.

The availability of massive amount of DNA sequence data and protein structure data has spurred the need to extract the embedded information by computational and analytical means. The need is the major impetus for developing bioinformatics and computational biology. In this course, we will explore topics in gene expression studies, sequence alignment and protein structure prediction.

The philosophy of this class is that we would like to train collaborators, not hybrids in the bioinformatics area. Making hybrids, teaching biologist enough math, or teaching statisticians (or computer scientists) enough biology so each group can be useful on their own, is not practical. The winning strategy is to teach collaborators. There are mathematicians who want to solve problems in biology. However, they solve problems that don't have impact on the real world. Statisticians need to ask how do we use what we know to solve the problems that biologists have. Having solved the problem how do we explain to biologists so they will understand the solutions. We have plenty of people who know math and plenty who know biology. However, we have no communication. So the emphasis of this course is on communication among biologists, computer scientists, and statisticians.

Several group projects will be developed in the course. It is planned that each group has at least one biologist, one statistician, and one computer scientist. A term paper and an oral presentation are required at the end of the semester. You will be graded based on this term paper and the oral presentation.

Week 1 (Jan. 23 and 28): Organization of the course (Kuo)

Biology of Gene and Protein (Lapierre)

Week 2 (Jan. 30 and Feb. 4): Genome Sequencing and Annotation (Kuo)

Week 3 (Feb. 6): Mining the Genome (Shin) (Shin_1, Shin_2,Shin_3)

Week 3 (Feb. 11): Microarray Structure (Krueger) (Krueger_1)

Week 4 (Feb. 13): Mining the Genome

Week 4 (Feb. 18): Data Interpretation for Microarray (Krueger) (Krueger_2)

Week 5 (Feb. 24, 6-9pm): Hierarchical modeling and variance components, (MCMC), normalization methods

Week 6 (Feb. 27 and March 4) Comparative analysis, false discovery rates, permutation analysis (Wong&Tseng_1,Wong&Tseng_2)

Week 7 (March 11 and 13) Clustering, self-organizing map and dimension reduction(Han&Kamber_8)

Week 8 (March 25 and 27) supervised learning, neural network, classification, prediction (Han&Kamber_7)

Week 9 (April 1) support vector machines (svm)

Week 10 (April 3 and 8) Proteomics and Functional Genomics

Week 11 (April 10 and 15) Building phylogenetic trees

Week 12 (April 17 and 22) Integrative Genomics

Weeks 13, 14 and 15: Research Presentations