The National Program on
Complex Data Structures


The First Canadian Workshop on Statistical Genomics
to be held at The Fields Institute, September 3-5, 2003

Audio and Slides Workshop Home Page FieldsVisitor Resources Housing & Hotels


Vincent J. Carey, Harvard Medical School
Genomic EDA and Modeling with R/Bioconductor
Bioconductor ( pursues the creation of flexible and portable tools for statistical analysis of genomic data. I will describe Bioconductor facilities for exploratory data analysis and flexible statistical inference with microarray data. Particular examples will include classification of gene expression densities, visualization and inference on genomic network structures, and flexible methods for testing hypotheses about the roles of pathways and pathway components in gene expression studies.

Christopher Field, Dalhousie University
Robustness Issues in Phylogeny
To estimate the tree structure for a set of taxa, we typically use a statistical model for evolution and compute the maximum likelihood estimate. Molecular Biologists recognize that the model is a rough approximation to reality and there is considerable literature on the effects of model deviations. In this talk, I will examine some of these deviations paying particular attention to the type of robustness methodology needed to successfully estimate the tree and make reliable inference.

Greg Gloor, University of Western Ontario
Co-evolution and mutual information of amino acid positions in protein families
Proteins are extremely complicated molecular machines that have evolved to perform a particular cellular function. While knowing the structure of a given protein often gives valuable insights into its function, there are also many unanswered questions. This is because each structure is a snapshot of one particular conformation of a protein isolated from one individual species. In many instances functionally important amino acid positions are conserved, but mutation
studies show that many non-conserved positions equally important. We are using mutual information to find these important, yet variable, amino acid positions in protein families. I will describe our progress on this project, and present some strengths and limitations of the current generation of tools used to show the correspondence between structure and sequence.

David Sankoff, University of Ottawa
Far-reaching effects of missing map data and local shuffling on the inference of genome rearrangement history
Joint work with Phil Trinh. Until recently algorithms for studying the evolution of gene order could only be applied to small genomes (mitochondria, chloroplasts, prokaryotes), the difficulty with mammalian and other larger eukaryotic nuclear genomes lying not so much in their much greater length but rather in the absence of comprehensive lists of genes and their orthologs. Pavel Pevzner and Glen Tesler (PNAS 2003) have suggested a way to bypass gene finding and ortholog identification by using the order of syntenic blocks constructed solely from sequence data as input to a genome rearrangement algorithm. The method focuses on major evolutionary events by glossing over small block-internal rearrangements, and neglecting intervening blocks smaller than a threshold length. This use of large "sanitized" blocks, and the neglect of short blocks may, however, blur important parts of the historical derivation of the genomes. We model the effects of eliminating and amalgamating short blocks, concentrating on the summary statistic of`"breakpoint re-use" introduced by Pevzner and Tesler. They did not conceive of this as an evolutionary distance, but in the context of their protocol it effectively measures to what extent genomes have diverged in becoming random permutations of blocks with respect to each other. We use analytic and simulation methods to investigate breakpoint re-use as a function of threshold size and of rearrangement parameters. We discuss the implication of our findings for the comparison of mammalian genomes and suggest a number of mathematical, algorithmic and statistical lines for further developing the Pevzner-Tesler approach.

David Tritchler, University of Toronto
A Spectral Clustering Method for Microarray Data
Joint work with Shafagh Fallah and Joseph Beyene. Cluster analysis is a commonly used dimension reduction technique. This talk introduces a clustering method motivated by a multivariate analysis of variance model and computationally based on eigenanalysis (thus the term ``spectral" in the title). Our focus is on large problems, and we present the method in the context of clustering genes and arrays using microarray expression data. The computational algorithm for the method has complexity linear in the number of genes.

Of the numerous methods for constructing clusters
from microarray data, many require that the number of clusters believed present in the data be specified a priori, and in general judgements about the appropriate number of clusters is problematic. We also introduce a method for assessing the number of clusters exhibited in microarray data based on the eigenvalues of a particular matrix.

Jean Yee Hwa Yang, University of California, San Francisco
Statistical Issues in the Design of Microarray Experiments
Microarray experiments performed in many areas of biological sciences generate large and complex multivariate datasets. This talk addresses statistical design and analysis issues, which are essential to improve the efficiency and reliability of cDNA microarray experiments. We discuss various considerations unique to the design of cDNA microarrays, and examine how different types of replication affect design decisions. We calculate variances of two classes of estimates of differential gene expression based on log ratios of fluorescence intensities from cDNA microarray experiments: direct estimates, using measurements from the same slide, and indirect estimates, using measurements from different slides. These variances are compared and numerical estimates are obtained from a small experiment. Some qualitative and quantitative conclusions are drawn which have potential relevance to the design of cDNA microarray experiments.

Kenny Q Ye and Anil Dhundale, SUNY at Stony Brook
Pooling or not pooling in microarray experiments - an experimental design point of view
Microarray experiments are often used to detect differences in gene expression between two populations of cells; a test population versus a control population. However in many cases, such as individuals in a population, the biological variability can present changes that are irrelevant to the question of interest and it then becomes important to assay many individual samples to collect statistically meaningfully results. Unfortunately the cost of performing some types of microarray experiments can be prohibitive. A potentially effective but not well publicized alternative is to pool individual RNA samples together for hybridization on a single microarray. This method can dramatically reduce the experimental costs while maintaining high power in detecting the changes in expression levels that relate to the specific question of interest. In this talk, we will discuss why this technique works and the optimal design strategy for pooling. This idea will also be illustrated by a synthetic experiment and a real experiment that studies Afib (cardiac atrial fibrillation), a condition that is a serious health condition that affects a large percent of the population but mechanistically remains not well understood.

Back to Top

Back to Workshop Home Page