April 21, 2015

April 9 (3:30 p.m.) & April 10 (11:00 a.m.), 2015
Distinguished Lecture Series in Statistical Science

Room 230, Fields Institute

Terry Speed
Walter & Eliza Hall Institute of Medical Research, Melbourne

Terry Speed is currently a Senior Principal Research Scientist at the Walter and Eliza Hall Institute of Medical Research. His lab has a particular focus on molecular data collected by cancer researchers, but also works with scientists studying immune and infectious diseases, and those who do research in basic biomedical science. His research interests are broad, but include the statistical and bioinformatic analysis of microarray, DNA sequence and mass spectrometry data from genetics, genomics, proteomics, and metabolomics. The lab works with molecular data at several different levels, from the lowest level where the data come directly from the instruments that generate it, up to the tasks of data integration, and of relating molecular to clinical data. Speed has served on the editorial board of many publications including the Journal of Computational Biology, JASA, Bernoulli and the Australian and New Zealand Journal of Statistics, and has been recognized by the Australian Prime Minister's Prize for Science and Eureka Prize for Scientific Leadership among other awards.

General lecture: Epigenetics: A New Frontier

Abstract: Scientists have now mapped the human genome - the next frontier is understanding human epigenomes; the 'instructions' which tell the DNA whether to make skin cells or blood cells or other body parts. Apart from a few exceptions, the DNA sequence of an organism is the same whatever cell is considered. So why are the blood, nerve, skin and muscle cells so different and what mechanism is employed to create this difference? The answer lies in epigenetics. If we compare the genome sequence to text, the epigenome is the punctuation and shows how the DNA should be read. Advances in DNA sequencing in the last 5-8 years have allowed large amounts of DNA sequence data to be compiled. For every single reference human genome, there will be literally hundreds of reference epigenomes, and their analysis will occupy biologists, bioinformaticians and biostatisticians for some time to come. In this talk I will introduce the topic and the data, and outline some of the challenges.

Specialized lecture: Normalization of omic data after 2007
(joint with Johann Gagnon-Bartsch and Laurent Jacob)

Abstract: For over a decade now, normalization of transcriptomic, genomic and more recently metabolomic and proteomic data has been something you do to "raw" data to remove biases, technical artifacts and other systematic non-biological features. These features could be due to sample preparation and storage, reagents, equipment, people and so on. It was a "one-off" fix to what I'm going to call removing unwanted variation. Since around 2007, a more nuanced approach has been available, due to JT Leek and J Storey (SVA) and O Stegle et al (PEER). These new approaches do two things differently. The first is that they do not assume the sources of unwanted variation are known in advance, they are inferred from the data. And secondly, they deal with the unwanted variation in a model-based way, not "up front." That is, they do it in a problem-specific manner, where different inference problems warrant different model-based solutions. For example, the solution for removing unwanted variation in estimation not necessarily being the same as doing for prediction. Over the last few years, I have been working with Johann Gagnon-Bartsch and Laurent Jacob on these same problems through making use of positive and negative controls, a strategy which we think has some advantages. In this talk I'll review the area, and highlight some of the advantages of working with controls. Illustrations will be from microarray, mass spec and RNA-seq data.


April 23 (3:30 p.m.) & 24 (11:00 a.m.), 2015
Distinguished Lecture Series in Statistical Science

Room 230, Fields Institute

Bin Yu
University of California, Berkeley

Bin Yu is Chancellor’s Professor in the Departments of Statistics and of Electrical Engineering & Computer Science at the University of California at Berkeley. Her current research interests focus on statistics and machine learning theory, methodologies, and algorithms for solving high-dimensional data problems. Her group is engaged in interdisciplinary research with scientists from genomics, neuroscience, and remote sensing. She is Member of the U.S. National Academy of Sciences and Fellow of the American Academy of Arts and Sciences. She was a Guggenheim Fellow in 2006, an Invited Speaker at ICIAM in 2011, and the Tukey Memorial Lecturer of the Bernoulli Society in 2012. She was President of IMS (Institute of Mathematical Statistics) in 2013-2014.

April 23, 2015 at 3:30: Stability

Reproducibility is imperative for any scientific discovery. More often than not, modern scientific findings rely on statistical analysis of highdimensional data. At a minimum, reproducibility manifests itself in stability of statistical results relative to “reasonable” perturbations to data and to the model used. Jacknife, bootstrap, and cross-validation are based on perturbations to data, while robust statistics methods deal with perturbations to models.

In this talk, a case is made for the importance of stability in statistics. Firstly, we motivate the necessity of stability of interpretable encoding models for movie reconstruction from brain fMRI signals. Secondly, we find strong evidence in the literature to demonstrate the central role of stability in statistical inference. Thirdly, a smoothing parameter selector based on estimation stability (ES), ES-CV, is proposed for Lasso, in order to bring stability to bear on cross-validation (CV). ES-CV is then utilized in the encoding models to reduce the number of predictors by 60% with almost no loss (1.3%) of prediction performance across over 2,000 voxels. Last, a novel “stability” argument is seen to drive new results that shed light on the intriguing interactions between sample to sample variability and heavier tail error distribution (e.g. double-exponential) in high dimensional regression models with p predictors and n independent samples. In particular, when p/n belongs to (0.3, 1) and error is double-exponential, the Least Squares (LS) is a better estimator than the Least Absolute Deviation (LAD) estimator.

This talk draws materials from papers with S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, J. L. Gallant, with C. Lim, and with N. El Karoui, D. Bean, P. Bickel, and C. Lim.


April 24, 2015 at 11: The multi-facets of a data science project to answer: how are organs formed?

Genome wide data reveal an intricate landscape where gene actions and interactions in diverse spatial areas are common both during development and in normal and abnormal tissues. Understanding local gene networks is thus key to developing treatments for human diseases. Given the size and complexity of recently available systematic spatial data, defining the biologically relevant spatial areas and modeling the corresponding local biological networks present an exciting and on-going challenge. It requires the integration of biology, statistics and computer science; that is, it requires data science.

In this talk, I present results from a current project co-led by biologist Erwin Frise from Lawrence Berkeley National Lab (LBNL) to answer the fundamental systems biology question in the talk title. My group (Siqi Wu, Antony Joseph, Karl Kumbier) collaborates with Dr. Erwin and other biologists (Ann Hommands) of Celniker’s Lab at LBNL that generate the Drosophila spatial expression embryonic image data. We leverage our group’s prior research experience from computational neuroscience to use appropriate ideas of statistical machine learning in order to create a novel image representation decomposing spatial data into building blocks (or principal patterns). These principal patterns provide an innovative and biologically meaningful approach for the interpretation and analysis of large complex spatial data. They are the basis for constructing local gene networks, and we have been able to reproduce almost all the links in the Nobel-prize winning (local) gap-gene network. In fact, Celniker’s lab is running knock-out experiments to validate our predictions on genegene interactions. Moreover, to understand the decomposition algorithm of images, we have derived sufficient and almost necessary conditions for local identifiability of the algorithm in the noiseless and complete case. Finally, we are collaborating with Dr. Wei Xu from Tsinghua Univ to devise a scalable open software package to manage the acquisition and computation of imaged data, designed in a manner that will be usable by biologists and expandable by developers.


Distinguished Lecture Series in Statistical Science Index