June 14, 2024

Applications of Statistical Science

Seminar Series

Additive Logistic Regression: a Statistical View of Boosting

Rob Tibshirani, Stanford University

December 17th 1998

Boosting (Freund and Schapire, 1995) is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multi-class generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multi-class generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large scale data mining applications.

This is joint work with Jerome Friedman and Trevor Hastie

Statistics and the Genetics of Complex Human Disease

Shelley Bull
Samuel Lunenfeld Research Institute of Mount Sinai Hospital and
Department of Public Health Sciences, University of Toronto

February 2nd 1999

The relationship between statistics and research in human genetics goes back to the work of early statisticians such as Fisher, Haldane, Pearson, and others. Because of random aspects of the process of meiosis that produces human gametes, transmission of genetic information from parents to offspring is inherently probabilistic. In many cases, the subject of interest, the gene, is not directly observable and its location on the genome is unknown, so typically inference is based on likelihoods that consider all possible observations that are consistent with known information. Similarly, because of the importance of pedigrees that extend across generations, data on some family members are often incomplete. The development of new molecular technologies has dramatically changed the nature and the volume of data that are available for the study of complex human disease. It is estimated that the completion of the Human Genome Project will reveal the location of 100,000 genes. As investigators turn their attention from simple single-gene Mendelian disease to the "genetic dissection of complex traits" such as diabetes and cardiovascular disease that depend on many genes, statistical methods such as regression and mixture distributions are required to model heterogeneity from known and unspecified sources. I will review current statistical approaches to finding genes for complex human disease and highlight some recent advances in these approaches.

The smell of greasepaint, the roar of the computer:
25 years of television network election night forecasting

David Andrews, University of Toronto

March 2nd 1999

Election-night forecasting includes statistical components for data collection, prediction and display. The evolution of these components and their interaction is reviewed. The statstical aspects of this activity have application to a broader class of problems. The role of a mathematical scientist in complex activities is discussed.

Computational Solutions to Problems Encountered in Spatial Data Analysis

Carl G. Amrhein, PhD
Department of Geography and The Graduate Program in Planning, Faculty of Arts and Science, University of Toronto

(Joint work with David Buckeridge, MD, MSc
Department of Public Health Sciences, Faculty of Medicine, University of Toronto)

April 6th 1999

The goal of this seminar is to provide an overview of current problems in spatial analysis, and explore computational solutions to some of these problems.

The seminar will begin with a review of the basic assumptions underlying spatial statistics, then move on to briefly examine the issue of spatial autocorrelation. Following this, a number of current problems in spatial analysis will be presented, including edge effects, the overlay problem, incompatible data sets, and aggregation effects. After this overview, the problem of aggregation effects will be examined more closely. Specifically, the importance of the problem will be discussed, and work done in this area will be reviewed.

An epidemiological study will then be used to examine some applied problems and solutions in spatial analysis with a focus on spatial autocorrelation. In this study, a GLM is used to model the relationship between hospital admission for respiratory illness and estimated exposure to motor vehicle emissions. Issues associated with data source overlay and georeferencing are briefly illustrated. Moran's coefficient of spatial autocorrelation is used to describe the degree of spatial autocorrelation in individual variables, and examine regression residuals to explore the possible effect of spatial autocorrelation on the standard error of regression parameter estimates. Finally, some further issues will be briefly discussed such as regression models that explicitly incorporate spatial autocorrelation, and the choice of connectivity measures in spatial analysis.

Estimating and Improving Generalization Error for Non-IID Data

Yoshua Bengio
Departement d'Informatique et Recherche Operationnelle and Centre de Recherches Mathematiques, Université de Montréal

May 4th 1999

The objective of machine learning is not necessarily to discover the underlying distribution of the observed data, but rather to infer a function from data coming from an unknown distribution, such that this function can be used to make predictions or take decisions on new data from the same unknown distribution. In this talk we will talk about a particular class of machine learning problems in which, contrary to what is usually assumed in econometrics and machine learning, the data are not assumed to be stationary or IID (independently and identically distributed). First we will discuss a framework in which the usual notions of generalization error can be extended to this case. Second we will discuss briefly how different results can be obtained when testing hypotheses about the model rather than testing hypotheses about the generalization error.

In the second part of this talk, I will present some of our work on improving generalization error when the data are not IID, based on the optimization of so-called hyper-parameters. Many machine learning algorithms can be formulated as the minimization of a training criterion which involves both training errors on each training example and some hyper-parameters, which are kept fixed during this minimization. When there is only a single hyper-parameter one can easily explore how its value affects a model selection criterion (that is not the same as the training criterion, and is used to select hyper-parameters). We will briefly describe a new methodology to select many hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyper-parameters. In this talk we will present an application of this methodology to the selection of hyper-parameters that control how much weight should be put on each training example in time-series prediction, putting more weight on more recent examples, in a way that is controlled by these hyper-parameters. Statistically significant improvements were obtained in predicting future volatility of Canadian stocks using this method, in a very simple setting. Body text goes here