
THE FIELDS
INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES 

April 9 (3:30 p.m.) & April 10 (11:00 a.m.),
2015
Distinguished Lecture Series in Statistical Science
Room
230, Fields Institute
Terry Speed
Walter & Eliza Hall Institute of Medical Research, Melbourne



Terry Speed is currently a Senior Principal Research Scientist at the Walter
and Eliza Hall Institute of Medical Research. His lab has a particular focus
on molecular data collected by cancer researchers, but also works with scientists
studying immune and infectious diseases, and those who do research in basic
biomedical science. His research interests are broad, but include the statistical
and bioinformatic analysis of microarray, DNA sequence and mass spectrometry
data from genetics, genomics, proteomics, and metabolomics. The lab works
with molecular data at several different levels, from the lowest level where
the data come directly from the instruments that generate it, up to the tasks
of data integration, and of relating molecular to clinical data. Speed has
served on the editorial board of many publications including the Journal of
Computational Biology, JASA, Bernoulli and the Australian and New Zealand
Journal of Statistics, and has been recognized by the Australian Prime Minister's
Prize for Science and Eureka Prize for Scientific Leadership among other awards.
General lecture: Epigenetics: A New Frontier
Abstract: Scientists have now mapped the human genome  the next
frontier is understanding human epigenomes; the 'instructions' which tell
the DNA whether to make skin cells or blood cells or other body parts.
Apart from a few exceptions, the DNA sequence of an organism is the same
whatever cell is considered. So why are the blood, nerve, skin and muscle
cells so different and what mechanism is employed to create this difference?
The answer lies in epigenetics. If we compare the genome sequence to text,
the epigenome is the punctuation and shows how the DNA should be read.
Advances in DNA sequencing in the last 58 years have allowed large amounts
of DNA sequence data to be compiled. For every single reference human
genome, there will be literally hundreds of reference epigenomes, and
their analysis will occupy biologists, bioinformaticians and biostatisticians
for some time to come. In this talk I will introduce the topic and the
data, and outline some of the challenges.
Specialized lecture: Normalization of omic data after 2007
(joint with Johann GagnonBartsch and Laurent Jacob)
Abstract: For over a decade now, normalization of transcriptomic,
genomic and more recently metabolomic and proteomic data has been something
you do to "raw" data to remove biases, technical artifacts and
other systematic nonbiological features. These features could be due
to sample preparation and storage, reagents, equipment, people and so
on. It was a "oneoff" fix to what I'm going to call removing
unwanted variation. Since around 2007, a more nuanced approach has been
available, due to JT Leek and J Storey (SVA) and O Stegle et al (PEER).
These new approaches do two things differently. The first is that they
do not assume the sources of unwanted variation are known in advance,
they are inferred from the data. And secondly, they deal with the unwanted
variation in a modelbased way, not "up front." That is, they
do it in a problemspecific manner, where different inference problems
warrant different modelbased solutions. For example, the solution for
removing unwanted variation in estimation not necessarily being the same
as doing for prediction. Over the last few years, I have been working
with Johann GagnonBartsch and Laurent Jacob on these same problems through
making use of positive and negative controls, a strategy which we think
has some advantages. In this talk I'll review the area, and highlight
some of the advantages of working with controls. Illustrations will be
from microarray, mass spec and RNAseq data.

April 23
(3:30 p.m.) & 24
(11:00 a.m.), 2015
Distinguished Lecture Series in Statistical Science
Room
230, Fields Institute
Bin Yu
University of California, Berkeley


Bin Yu is Chancellor’s Professor in the Departments of Statistics and
of Electrical Engineering & Computer Science at the University of California
at Berkeley. Her current research interests focus on statistics and machine
learning theory, methodologies, and algorithms for solving highdimensional
data problems. Her group is engaged in interdisciplinary research with scientists
from genomics, neuroscience, and remote sensing. She is Member of the U.S.
National Academy of Sciences and Fellow of the American Academy of Arts and
Sciences. She was a Guggenheim Fellow in 2006, an Invited Speaker at ICIAM
in 2011, and the Tukey Memorial Lecturer of the Bernoulli Society in 2012.
She was President of IMS (Institute of Mathematical Statistics) in 20132014.
April 23, 2015 at 3:30: Stability
Reproducibility is imperative for any scientific discovery. More
often than not, modern scientific findings rely on statistical analysis of
highdimensional data. At a minimum, reproducibility manifests itself in stability
of statistical results relative to “reasonable” perturbations to
data and to the model used. Jacknife, bootstrap, and crossvalidation are
based on perturbations to data, while robust statistics methods deal with
perturbations to models.
In this talk, a case is made for the importance of stability in statistics.
Firstly, we motivate the necessity of stability of interpretable encoding
models for movie reconstruction from brain fMRI signals. Secondly, we find
strong evidence in the literature to demonstrate the central role of stability
in statistical inference. Thirdly, a smoothing parameter selector based
on estimation stability (ES), ESCV, is proposed for Lasso, in order to
bring stability to bear on crossvalidation (CV). ESCV is then utilized
in the encoding models to reduce the number of predictors by 60% with almost
no loss (1.3%) of prediction performance across over 2,000 voxels. Last,
a novel “stability” argument is seen to drive new results that
shed light on the intriguing interactions between sample to sample variability
and heavier tail error distribution (e.g. doubleexponential) in high dimensional
regression models with p predictors and n independent samples. In particular,
when p/n belongs to (0.3, 1) and error is doubleexponential, the Least
Squares (LS) is a better estimator than the Least Absolute Deviation (LAD)
estimator.
This talk draws materials from papers with S. Nishimoto, A. T. Vu, T. Naselaris,
Y. Benjamini, J. L. Gallant, with C. Lim, and with N. El Karoui, D. Bean,
P. Bickel, and C. Lim.
April 24, 2015 at 11: The multifacets of a data science project
to answer: how are organs formed?
Genome wide data reveal an intricate landscape where gene actions and interactions
in diverse spatial areas are common both during development and in normal
and abnormal tissues. Understanding local gene networks is thus key to developing
treatments for human diseases. Given the size and complexity of recently
available systematic spatial data, defining the biologically relevant spatial
areas and modeling the corresponding local biological networks present an
exciting and ongoing challenge. It requires the integration of biology,
statistics and computer science; that is, it requires data science.
In this talk, I present results from a current project coled by biologist
Erwin Frise from Lawrence Berkeley National Lab (LBNL) to answer the fundamental
systems biology question in the talk title. My group (Siqi Wu, Antony Joseph,
Karl Kumbier) collaborates with Dr. Erwin and other biologists (Ann Hommands)
of Celniker’s Lab at LBNL that generate the Drosophila spatial expression
embryonic image data. We leverage our group’s prior research experience
from computational neuroscience to use appropriate ideas of statistical
machine learning in order to create a novel image representation decomposing
spatial data into building blocks (or principal patterns). These principal
patterns provide an innovative and biologically meaningful approach for
the interpretation and analysis of large complex spatial data. They are
the basis for constructing local gene networks, and we have been able to
reproduce almost all the links in the Nobelprize winning (local) gapgene
network. In fact, Celniker’s lab is running knockout experiments to
validate our predictions on genegene interactions. Moreover, to understand
the decomposition algorithm of images, we have derived sufficient and almost
necessary conditions for local identifiability of the algorithm in the noiseless
and complete case. Finally, we are collaborating with Dr. Wei Xu from Tsinghua
Univ to devise a scalable open software package to manage the acquisition
and computation of imaged data, designed in a manner that will be usable
by biologists and expandable by developers.
Distinguished Lecture
Series in Statistical Science Index

