The National Program on
Complex Data Structures


Workshop on Data Mining Methodology and Applications
October 28-30, 2004
at The Fields Institute, Toronto
Supported by

Speaker Titles and Abstracts Index

Roberto Aldave and Simon Gluzman
David Banks
Yoshua Bengio
Merlise Clyde
Adele Cutler
Alex Depoutovitch
Jerome H Friedman
Wenxue Huang
Grigoris Karakoulas
Theodora Kourti
Helmut Kroeger
Xianping Liu
Joaquín Ordieres Meré
Saharon Rosset and Ji Zhu
Russell Steele
Godfried T. Toussaint
Steven Wang
S. Stanley Young, with Douglas M. Hawkins, Li Liu, Aventis
Ruben. Zamar
Ji Zhu
Mu Zhu
Djamel A. Zighed

Roberto Aldave and Simon Gluzman, Generation 5 Mathematical Technologies Inc
Prediction of Real Variables with Non-Polynomial Approximants

Multivariate Non-Polynomial Approximants have been applied for numerical prediction of real variables dependent on real variables. Calculations were performed for several atabases. Accuracy of the method is compared with some known methods including different local regressions. We argue that the method is effective for discovery of non-linear dependencies.

David Banks, Duke University
Scalability of Models in Data Mining

The SAMSI data mining year examined a range of problems in the area, from use of unlabeled sample in classification to issues in overcompleteness. A major theme concerned the analysis of high-dimension, low sample-size datasets. This talk reviews various strategies for handling these problems, which are closely related to the Curse of Dimensionality and also relevant to data quality. Our work explores the effect of combining smart algorithms, aggressive feature selection, and combinatorial search. The ideas are illustrated through several examples, and we draw conclusions and give advice for those who must analyze such data.

Keywords: Curse of Dimensionality ; large p, small n ; regression; MDS ; data mining ; selection

Yoshua Bengio, Université de Montréal
Statistical Learning from High Dimensional and Complex Data: Not a Lost Cause

This talk will start from a sociological perspective on current trends in statistical machine learning, and with the claim that less and less research is attempting to address the really difficult question of generalization from high-dimensional and complex data sets, while the mathematical sophistication in analyzing and devising statistical learning algorithms has grown steadily in the last few years. To illustrate that issue I will present arguments to the effect that a large class of manifold
learning and spectral embedding algorithms (e.g. kernel PCA, spectral clustering, LLE, Isomap, Laplacian Eigenmaps) that are essentially local will suffer from at least four generic problems associated with noise in the data, curvature of the manifold, dimensionality of the manifold, and inability to take advantage of the presence of many manifolds with little data per manifold. This analysis suggests investigating non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions in data space. Unfortunately this opens to door to non-convex optimization problems, which have received strongly negative connotation in the statistical machine learning community in recent years. A simple (but non-convex) criterion for a non-local manifold learning algorithm is thus proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms. For example, from examples of rotated digits the algorithm generalizes to rotations of other characters, whereas a local method such as kernel PCA or Isomap fails.

Merlise Clyde, Duke University
Bayesian Perspectives on Combining Models

Consideration of multiple models is routine in statistical practice. With computational advances over the past decade, there has been increased interest in methods for making inferences based on combining models. Examples include boosting, bagging, stacking, and Bayesian Model Averaging (BMA), which often lead to improved performance over methods based on selecting a single model. Bernardo and Smith have described three Bayesian frameworks for model selection known as the M-closed, M-Complete, and M-open perspectives. The standard formulation of Bayesian Model Averaging arises as an optimal solution for combining models in the M-closed perspective where one believes that the ``true'' model is included in the list of models under consideration. In the M-complete and M-open perspectives the ``true'' model is outside the space of models to be combined, so that model averaging using posterior model probabilities is no longer applicable. Using a decision theoretic approach, we present optimal Bayesian solutions for combining models in these frameworks. We illustrate the methodology with an example of combining models representing two distinct classes, prospective classification trees and retrospective multivariate discriminant models applied to gene expression data in advanced stage serous ovarian cancers.
The goals of this analysis are two-fold: identifying molecular tumor characteristics associated with prognosis and determining if long-term
survival can be predicted by features intrinsic to the molecular biology of the tumor.


Adele Cutler, Utah State University
Random Forests: Proximity, Variable Importance, and Visualization*
The Random Forests algorithm, introduced by Breiman in 2001, has been shown to be competitive with some of the most accurate classification methods available. Unlike many other accurate methods, Random forests is interpretable via proximities and variable importance measures. In this talk, I will describe the methods we use to compute these measures and illustrate their use with a special-purpose java-based visualization package.
*Joint work with Leo Breiman

Alex Depoutovitch, Generation 5
The use of grid computing to speed up prediction

Rapidly growing size of data becoming available to data mining applications creates a real challenge to process such volume of data within short period of time. CPU speeds are improving but yet not enough to address this challenge - it may take days and even weeks to run such things like predictions required by analytical and marketing departments. To address this demand we developed a technology that together with highly scalable method of prediction allows executing calculations in parallel by using a grid of computers. Our benchmarks on life systems show almost linear scalability as the number of computers in grid grows.

Jerome H Friedman, Stanford University
Importance Sampling: An Alternative View of Ensemble Learning*

Learning a function of many arguments is viewed from the perspective of high-dimensional numerical integration. It is shown that many of the popular ensemble learning methods can be cast in this framework. In particular, bagging, boosting, and Bayesian model averaging are seen to correspond to Monte Carlo integration methods each based on different importance sampling strategies. This interpretation explains some of their properties and suggests modifications to them that can improve their accuracy and especially their computational performance.

*Joint work with Bogdan Popescu

Wenxue Huang, Generation 5 Mathematical Technologies Inc.
Dependence Degree and Feature Selection for Categorical Data

Traditionally the measure of association for cross-clasification for categorical data takes a view of variance or divergence. We take an opposite angle of view: convergence. This point of view has certain advantages. It allows us to see more directly and clearly how a vairable is associated with another/others both locally (veritically) and globally (horizontally). From this point of view we introduce a new measure of association, referred to as dependence degree, and dicuss one of its applications in data mining technologies: feature selection.

Grigoris Karakoulas, Computer Science University of Toronto
ROC-based Learning for Imbalanced Class Problems

In general, modelling techniques for learning probabilistic models from data can be categorized into discriminative (e.g. logistic regression) and generative (e.g. Normal Discriminant Analysis). In this paper we describe a new probabilistic modeling technique that combines the advantages of these two categories of techniques. More specifically, let us denote by X the vector of model input variables and by Y (0/1) the target variable. To learn the probability distribution P{X|Y} as in a generative model our technique is recursively searching for discriminatory projections in the input space that maximize the area under the Receiver-Operating-Characteristic (AUROC) curve. Although the latter is a well-accepted performance measure for evaluating probabilistic classification models, most of the generative and discriminative techniques in statistics and machine learning use other performance measures for building a model. In our case we use the same performance measure, AUROC, for both building and evaluating a model. Our experiments show that this new technique is particularly advantageous in imbalanced datasets due to the use of AUROC as the objective function.

Theodora Kourti. Chemical Engineering Department, McMaster University
Data Mining in Industry for Process and Product Improvement

With process computers routinely collecting data from on-line sensors on hundreds to thousands of variables every few seconds, large databases are accumulated in industry. The exploitation of these data is a critical component in the successful operation of any industrial process. However the task is difficult because of the nature of such data:

1) Large Data Sets
2) The data are highly correlated (many variables being collinear) and non-causal in nature.
3) The information contained in any one variable is often very small due to the low signal / noise ratios.
4) There are often missing measurements on many variables.

In order to utilize these databases, an empirical modelling method must be able to deal effectively with all these difficulties.

Multivariate projection methods have gained rapid acceptance by industry for troubleshooting, on-line process monitoring, fault diagnosis and equipment maintenance. Successful applications are reported by many diverse industries such as pharmaceuticals, semiconductor manufactures, steel producers, pulp and paper producers, polymer plants and refineries

This paper will discuss the latest developments on latent variable methods, speech recognition methods and missing data treatment for successful data mining in industry.

Helmut Kroeger, Laval University
Learning in neural networks with small-world architecture.

I will introduce small-world (sw) and scale-free (sf) networks, giving a few examples from organisation of society, internet, and biology. I will discuss experimental observations in neuroscience: cat cortex, macaque visual cortex, human brain activity network from magnetic resonance imaging. Then I address computer simulations involving sw- and sf-architecture: (a) fast response and coherent oscillations in a Hodgkin-Huxley sw-network, (b) efficient associative memory and pattern retrieval in sw-architecture, (c) supervised learning in multi-layer feed-forward networks.

Learning algorithms like backpropagation are commonly used in regular networks where neurons are organized in successive fully-connected layers. Here, the effect of learning is studied as a function of connectivity architecture of neurons in multi-layer networks going from regular to small-world to random connectivity. One observes that sw-connectivity offers great advantages on the learning speed of the backpropagation algorithms, especially when the number of layers is high. I will also discuss an example of generalisation.

Xianping Liu , Generation 5 Mathematical Technologies Inc.
Generation 5 Hybrid Clustering System and its Application

Clustering is used in data mining for .nding the useful patterns in large high dimensional databases. Some challenges for large data sets are scalability, automation, any types of data, input order dependency, and understanding end-results.

Generation 5 is developing a fully automatic hybrid clustering system Clus5, a part of G5MWM data mining suite, which addresses all the challenges above. Clus5 employs hybrid on-line and o.-line partitional clustering algorithms that are more e.ective and input order-independent. It uses two-stage approach. In stage 1, it clusteres multiple samples with di.erent initial partitions to produce a near-optimal initial seed based on one of built-in validity indexes. In stage 2, it clusteres the whole dataset using result from stage 1. Clus5 can automatically determine the optimal number of clusters and handle any types of data. It also has a special algorithm for marketing applications which prefer equal size clusters. Clus5 has been successfully applied to solve many real-life problems of Gen5 clients.

Joaquín Ordieres Meré, University of la Rioja (Spain)
Data-Mining for industrial processes

The most common goal of the factory owner is to achieve better quality in the final product by means of improved process control. The significance and relevance of optimizing the existing control models is even greater in the open-loop control systems or in those governed by computational methods dependent on adjustable parameters.

This talk reviews some typical industrial environments and focuses on some parts of them in order to show the real interest of these improvements. We will identify some difficulties in obtaining these improvements and show how the optimal control model for the manufacturing process can be obtained from data provided by sensors. We will also discuss some technical problems that are related to the main goal, and will identify some topics concerning outliers, density and topology. Also, we will show how these techniques can be applied as an instrumental toolbox in addressing some environmental problems.

Keywords: industrial applications; neural networks; outliers; density; improvement process control; advanced quality control

Saharon Rosset and Ji Zhu, University of Michigan
II regularization: efficient and effective

We consider the general regularized optimization problem of minimizing loss+penalty, where the loss depends on the data and the model, and the penalty on the model only. We illustrate that the choice of l_1 (lasso) penalty, in combination with appropriate loss, leads to several desirable properties: (1) Approximate or exact l_1 regularization has given rise to highly successful modeling tools: the lasso, boosting, wavelets and 1-norm support vector machines. (2) l_1 regularization creates sparse models, a property that is especially desirable in high-dimensional predictor spaces. We formulate and prove sparsity results. (3) l_1 regularization facilitates efficient methods for solving the regularized optimization problem. The LARS algorithm takes advantage of this property. We present a general formulation of regularized optimization problems for which efficient methods can be designed. We show how we can create modeling tools which are robust (because of the loss function selected), efficient (because we solve the regularized problem efficiently) and adaptive (because we select the regularization parameter adaptively).
(Joint work with Trevor Hastie and Rob Tibshirani.)

Russell Steele, McGill University
Algebraic Geometry and Model Selection for Naive Bayes Networks

Recent advancements in the machine learning community have been made in the application of algebraic geometry to Bayesian model selection for neural networks. Rusakov and Geiger (2003), using results from Watanabe (2001), present a new asymptotic approximation to the integrated likelihood for Naive Bayes networks (or, in more statistical language, finite mixture models). The key to the approximation is an attempt to correct the standard dimensionality penalty for the BIC to reflect the "true" effective dimensionality of the model. However, only limited work has been done to interpret these results with respect to current work on effective dimensionality in Bayesian research, particularly, in light of the currently widespread use of the DIC (Spiegelhalter, et al., 2002). In this talk, the speaker presents the basic idea behind the Rusakov/Geiger approximation, compares their approximation to the actual integrated likelihood, and links these results to what the DIC presumably is trying to estimate. The speaker will focus on how various standard information criteria perform for model selection for simple Naive Bayes networks.

Godfried T. Toussaint,School of Computer Science, McGill University
Proximity Graph Methods for Data Mining

In the typical approach to learning in data mining, random data (the training set of patterns) are collected and used to design a classification rule. One of the most well known such rules is the K-nearest-neighbor rule in which an unknown pattern is classified into the majority class among its K nearest neighbors in the training set.

Several questions related to this rule have received considerable attention over the past fifty years. Such questions include the following. How large should K be? How should a value of K be chosen? Should all K neighbors be equally weighted when used to decide the class of an unknown pattern? Should the features (measurements) be equally weighted? If not, how should these weights be chosen? What metric should be used?

How can the rule be made robust to outliers present in the training data? How can the storage of the training set be reduced without degrading the performance of the decision rule? How fast can the nearest neighbor of a query point be found? What is the smallest neural network that can implement the nearest neighbor rule? Geometric proximity graphs offer elegant solutions to most of these problems, in many cases, by discarding the classical methods altogether. In this talk we review and discuss recent work on the application of geometric proximity graphs in this area, propose some new solutions and mention some open problems.

Steven Wang, York University
Clustering Categorical Data Based on Distance Vectors

We introduce a novel statistical procedure to cluster categorical data based on distance vectors. The proposed method is conceptually simple and computationally straightforward as it does not require any specific statistical models or any convergence criteria. Moreover, unlike most existing algorithms that compute the class membership or membership probability for every data point at each iteration, our algorithm sequentially extracts clusters from the given data set. That is, at each iteration our algorithm only strives to identify one cluster, which will then be deleted from the data set at the next iteration; this procedure repeats until there are no more significant clusters in the remaining data. Consequently, the number of clusters can be determined automatically by the algorithm. As for the identification and extraction of a cluster, we first locate the cluster center by using a Pearson Chi-square type statistic on the basis of categorical distance vectors. The partition of the data set produced by our algorithm is unique and robust to the input order of data points. The performance of the proposed algorithm is examined by both simulated and real world data sets. Comparisons with two well-known clustering algorithms, K-modes and AutoClass, show that the proposed algorithm substantially outperforms these competitors, with the classification rate or the information gain typically improved by several orders of magnitudes.

S. Stanley Young, National Institute of Statistical Sciences
with Douglas M. Hawkins, U Minn and Li Liu, Aventis
Linking and pattern matching in multiple large data two-way tables

Drug discovery is coming into multiple large data sets, micro arrays, protein arrays, chemical structural descriptors. All of these data sets have many more columns than rows, n << p. We will explore various methods of pattern matching between multiple two-way tables. The benefit of the methods will be the discovery and possible validation of biological targets for drug discovery.

Ruben. Zamar, University of British Columbia
Yi Lin, Guohua Yan, Will Welch, Department of Statistics, UBC
Robust Methods and Data Mining

We consider the role of various performance measures for supervised and unsupervised learning. Real world applications of data-mining typically require
optimizing for non-standard performance measures that depend on the applica-tion at hand. For example, in spam mail filtering, error rate is a bad indicator
of performance because there is a strong imbalance in cost between missing a good message and letting through a spam message.

In our experience, the choice of a good subset of variables to perform a data mining task may be at least as important as the choice of the procedure. We
will illustrate this situation in the case of the Protein Homology Prediction Task for this year KDD Cup Competition.

We will also discuss the problem of detecting a relative small fraction of "interesting" cases (in the unsupervised learning setup) using robust Mahalanobis distances and distance based outliers.

Ji Zhu, Univesity of Michigan
Piecewise linear SVM paths

The support vector machine is a widely used tool for classification. In this talk, we consider two types of the support vector machine: the 1-norm SVM and the standard 2-norm SVM. Both types can be written as regularized optimization problems. In all current implementations for fitting an SVM model, the user has to supply a value for the regularization parameter. To select an appropriate regularization parameter, in practice, people usually pre-specify a finite set of values for the regularization parameter that covers a wide range, then either use a separate validation data set or use cross-validation to select a value for the regularization parameter that gives the best performance among the given set. In this talk, we argue that the choice of the regularization parameter can be critical. We also argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM under certain situations, especially when there are redundant noise features. We show that the solution paths for both the 1-norm SVM and the 2-norm SVM are piecewise linear functions of the regularization parameter. We then derive two algorithms, respectively for the 1-norm SVM and the 2-norm SVM, that can fit the entire paths of SVM solutions for every value of the regularization parameter, hence facilitate adaptive selection of the regularization parameter for SVM models.

It turns out that the piecewise linear solution path property is not unique to the SVM models. We will propose some general conditions on the generic regularized optimization problem for the solution path to be piecewise linear, which suggest some new useful predictive statistical modeling tools.

This is joint work with Saharon Rosset (IBM T.J.Watson), Trevor Hastie (Stanford U.), and Rob Tibshirani (Stanford U.)

Mu Zhu, University of Waterloo
An Adaptive Radial Basis Function Network Model for Statistical Detection

We construct a special radial basis function (RBF) network model to detect items belonging to a rare class from a large database. Our primary example is a real drug discovery application. Our method can be viewed as modeling only the rare class but allowing for local adjustments depending on the density of the background class in local neighborhoods. We offer a statistical explanation of why such an approach is appropriate and efficient for the detection problem. Our statistical explanation together with our empirical success with this model has implications for a new paradigm for solving these detection problems in general. This work is joint with Wanhua Su and Hugh Chipman.

Djamel A. Zighed, University of Lyon 2, France
Constructing induction graphs

The presentation will focus on tree issues concerning the construction of induction graphs from data.

The first one is how to obtain the optimal join-partitioning of the contingency tables. We propose a quasi-optimal algorithm for resizing contingency tables, which takes into account the type (nominal or ordinal) of the attributes crossed with the categorical variable to predict.

The second issue is how to use the results of the best-join-partitioning to build a new type of decision tree. The algorithm we propose is called Arbogodaï. It leads to a new structure of decision trees that could be seen as a generalisation of CART procedure. Indeed, instead of having for the predicted attribute two super classes, Arbogodaï defines the optimal number of super-classes. Also, the splits are not only binary, as in the CART algorithm, but their multiplicity depends on the local situation. At each node, the algorithm looks for the best join-partitioning on the contingency tables crossing the predicted attribute over all predictive ones.

The third issue of this presentation is how to provide an axiomatic definition for a measure of the quality of partition. We propose leading to an algorithm for constructing induction graphs that can be seen as a generalisation of a decision tree algorithm. The framework will be common for tree or lattice structures.

Alex Zolotovitski, Generation 5

Automated Trade area analysis. Case study of G5 MWM software application

We created an automatized tool for estimating, at the Zip+4 level, the dollar amount that each US household (approximately 35 million zip+4's)
spends in each US grocery store (approximately 50,000) for a specified group of about 60 products. These amounts were aggregated to estimate sales in each US grocery store for each specified group of products. We used non-parametric local and global regression; as data sources we used 1)panel data, which contained consumption of 290 brands of products for 37,000 Households in 717 groups of Stores; 2) attributes of 50,000 stores; and 3) G5 zip+4 level databases which contain almost 10,000 demographic, expenditure, and behavioral variables.


Back to Workshop page