




Workshop on Data Mining Methodology and Applications


Speaker Titles and Abstracts Index Roberto Aldave and Simon Gluzman,
Generation 5 Mathematical Technologies Inc Multivariate NonPolynomial Approximants have been applied for numerical prediction of real variables dependent on real variables. Calculations were performed for several atabases. Accuracy of the method is compared with some known methods including different local regressions. We argue that the method is effective for discovery of nonlinear dependencies. David Banks, Duke University The SAMSI data mining year examined a range of problems in the area,
from use of unlabeled sample in classification to issues in overcompleteness.
A major theme concerned the analysis of highdimension, low samplesize
datasets. This talk reviews various strategies for handling these problems,
which are closely related to the Curse of Dimensionality and also relevant
to data quality. Our work explores the effect of combining smart algorithms,
aggressive feature selection, and combinatorial search. The ideas are
illustrated through several examples, and we draw conclusions and give
advice for those who must analyze such data. Keywords: Curse of Dimensionality ; large p, small n ; regression; MDS
; data mining ; selection Yoshua Bengio, Université de Montréal This talk will start from a sociological perspective on current trends
in statistical machine learning, and with the claim that less and less
research is attempting to address the really difficult question of generalization
from highdimensional and complex data sets, while the mathematical sophistication
in analyzing and devising statistical learning algorithms has grown steadily
in the last few years. To illustrate that issue I will present arguments
to the effect that a large class of manifold Merlise Clyde, Duke University Bayesian Perspectives on Combining Models Consideration of multiple models is routine in statistical practice.
With computational advances over the past decade, there has been increased
interest in methods for making inferences based on combining models. Examples
include boosting, bagging, stacking, and Bayesian Model Averaging (BMA),
which often lead to improved performance over methods based on selecting
a single model. Bernardo and Smith have described three Bayesian frameworks
for model selection known as the Mclosed, MComplete, and Mopen perspectives.
The standard formulation of Bayesian Model Averaging arises as an optimal
solution for combining models in the Mclosed perspective where one believes
that the ``true'' model is included in the list of models under consideration.
In the Mcomplete and Mopen perspectives the ``true'' model is outside
the space of models to be combined, so that model averaging using posterior
model probabilities is no longer applicable. Using a decision theoretic
approach, we present optimal Bayesian solutions for combining models in
these frameworks. We illustrate the methodology with an example of combining
models representing two distinct classes, prospective classification trees
and retrospective multivariate discriminant models applied to gene expression
data in advanced stage serous ovarian cancers. Adele Cutler, Utah State University Random Forests: Proximity, Variable Importance, and Visualization* The Random Forests algorithm, introduced by Breiman in 2001, has been shown to be competitive with some of the most accurate classification methods available. Unlike many other accurate methods, Random forests is interpretable via proximities and variable importance measures. In this talk, I will describe the methods we use to compute these measures and illustrate their use with a specialpurpose javabased visualization package. *Joint work with Leo Breiman Alex Depoutovitch, Generation 5 The use of grid computing to speed up prediction Rapidly growing size of data becoming available to data mining applications
creates a real challenge to process such volume of data within short period
of time. CPU speeds are improving but yet not enough to address this challenge
 it may take days and even weeks to run such things like predictions
required by analytical and marketing departments. To address this demand
we developed a technology that together with highly scalable method of
prediction allows executing calculations in parallel by using a grid of
computers. Our benchmarks on life systems show almost linear scalability
as the number of computers in grid grows. Jerome H Friedman, Stanford University Learning a function of many arguments is viewed from the perspective of highdimensional numerical integration. It is shown that many of the popular ensemble learning methods can be cast in this framework. In particular, bagging, boosting, and Bayesian model averaging are seen to correspond to Monte Carlo integration methods each based on different importance sampling strategies. This interpretation explains some of their properties and suggests modifications to them that can improve their accuracy and especially their computational performance. *Joint work with Bogdan Popescu Wenxue Huang, Generation 5 Mathematical Technologies
Inc. Traditionally the measure of association for crossclasification for
categorical data takes a view of variance or divergence. We take an opposite
angle of view: convergence. This point of view has certain advantages.
It allows us to see more directly and clearly how a vairable is associated
with another/others both locally (veritically) and globally (horizontally).
From this point of view we introduce a new measure of association, referred
to as dependence degree, and dicuss one of its applications in
data mining technologies: feature selection. Grigoris Karakoulas, Computer Science
University of Toronto In general, modelling techniques for learning probabilistic models from
data can be categorized into discriminative (e.g. logistic regression)
and generative (e.g. Normal Discriminant Analysis). In this paper we describe
a new probabilistic modeling technique that combines the advantages of
these two categories of techniques. More specifically, let us denote by
X the vector of model input variables and by Y (0/1) the target variable.
To learn the probability distribution P{XY} as in a generative model
our technique is recursively searching for discriminatory projections
in the input space that maximize the area under the ReceiverOperatingCharacteristic
(AUROC) curve. Although the latter is a wellaccepted performance measure
for evaluating probabilistic classification models, most of the generative
and discriminative techniques in statistics and machine learning use other
performance measures for building a model. In our case we use the same
performance measure, AUROC, for both building and evaluating a model.
Our experiments show that this new technique is particularly advantageous
in imbalanced datasets due to the use of AUROC as the objective function. Theodora Kourti. Chemical Engineering Department,
McMaster University With process computers routinely collecting data from online sensors on hundreds to thousands of variables every few seconds, large databases are accumulated in industry. The exploitation of these data is a critical component in the successful operation of any industrial process. However the task is difficult because of the nature of such data: 1) Large Data Sets In order to utilize these databases, an empirical modelling method must be able to deal effectively with all these difficulties. Multivariate projection methods have gained rapid acceptance by industry for troubleshooting, online process monitoring, fault diagnosis and equipment maintenance. Successful applications are reported by many diverse industries such as pharmaceuticals, semiconductor manufactures, steel producers, pulp and paper producers, polymer plants and refineries This paper will discuss the latest developments on latent variable methods,
speech recognition methods and missing data treatment for successful data
mining in industry. Helmut Kroeger, Laval University I will introduce smallworld (sw) and scalefree (sf) networks, giving
a few examples from organisation of society, internet, and biology. I
will discuss experimental observations in neuroscience: cat cortex, macaque
visual cortex, human brain activity network from magnetic resonance imaging.
Then I address computer simulations involving sw and sfarchitecture:
(a) fast response and coherent oscillations in a HodgkinHuxley swnetwork,
(b) efficient associative memory and pattern retrieval in swarchitecture,
(c) supervised learning in multilayer feedforward networks. Learning algorithms like backpropagation are commonly used in regular
networks where neurons are organized in successive fullyconnected layers.
Here, the effect of learning is studied as a function of connectivity
architecture of neurons in multilayer networks going from regular to
smallworld to random connectivity. One observes that swconnectivity
offers great advantages on the learning speed of the backpropagation algorithms,
especially when the number of layers is high. I will also discuss an example
of generalisation. Xianping Liu , Generation 5 Mathematical Technologies
Inc. Clustering is used in data mining for .nding the useful patterns in large high dimensional databases. Some challenges for large data sets are scalability, automation, any types of data, input order dependency, and understanding endresults. Generation 5 is developing a fully automatic hybrid clustering system Clus5, a part of G5MWM data mining suite, which addresses all the challenges above. Clus5 employs hybrid online and o.line partitional clustering algorithms that are more e.ective and input orderindependent. It uses twostage approach. In stage 1, it clusteres multiple samples with di.erent initial partitions to produce a nearoptimal initial seed based on one of builtin validity indexes. In stage 2, it clusteres the whole dataset using result from stage 1. Clus5 can automatically determine the optimal number of clusters and handle any types of data. It also has a special algorithm for marketing applications which prefer equal size clusters. Clus5 has been successfully applied to solve many reallife problems of Gen5 clients. Joaquín Ordieres Meré, University
of la Rioja (Spain) The most common goal of the factory owner is to achieve better quality in the final product by means of improved process control. The significance and relevance of optimizing the existing control models is even greater in the openloop control systems or in those governed by computational methods dependent on adjustable parameters. This talk reviews some typical industrial environments and focuses on some parts of them in order to show the real interest of these improvements. We will identify some difficulties in obtaining these improvements and show how the optimal control model for the manufacturing process can be obtained from data provided by sensors. We will also discuss some technical problems that are related to the main goal, and will identify some topics concerning outliers, density and topology. Also, we will show how these techniques can be applied as an instrumental toolbox in addressing some environmental problems. Keywords: industrial applications; neural networks; outliers; density;
improvement process control; advanced quality control Saharon Rosset and Ji Zhu, University
of Michigan We consider the general regularized optimization problem of minimizing
loss+penalty, where the loss depends on the data and the model, and the
penalty on the model only. We illustrate that the choice of l_1 (lasso)
penalty, in combination with appropriate loss, leads to several desirable
properties: (1) Approximate or exact l_1 regularization has given rise
to highly successful modeling tools: the lasso, boosting, wavelets and
1norm support vector machines. (2) l_1 regularization creates sparse
models, a property that is especially desirable in highdimensional predictor
spaces. We formulate and prove sparsity results. (3) l_1 regularization
facilitates efficient methods for solving the regularized optimization
problem. The LARS algorithm takes advantage of this property. We present
a general formulation of regularized optimization problems for which efficient
methods can be designed. We show how we can create modeling tools which
are robust (because of the loss function selected), efficient (because
we solve the regularized problem efficiently) and adaptive (because we
select the regularization parameter adaptively). Russell Steele, McGill University Recent advancements in the machine learning community have been made in the application of algebraic geometry to Bayesian model selection for neural networks. Rusakov and Geiger (2003), using results from Watanabe (2001), present a new asymptotic approximation to the integrated likelihood for Naive Bayes networks (or, in more statistical language, finite mixture models). The key to the approximation is an attempt to correct the standard dimensionality penalty for the BIC to reflect the "true" effective dimensionality of the model. However, only limited work has been done to interpret these results with respect to current work on effective dimensionality in Bayesian research, particularly, in light of the currently widespread use of the DIC (Spiegelhalter, et al., 2002). In this talk, the speaker presents the basic idea behind the Rusakov/Geiger approximation, compares their approximation to the actual integrated likelihood, and links these results to what the DIC presumably is trying to estimate. The speaker will focus on how various standard information criteria perform for model selection for simple Naive Bayes networks. Godfried T. Toussaint,School of Computer
Science, McGill University In the typical approach to learning in data mining, random data (the
training set of patterns) are collected and used to design a classification
rule. One of the most well known such rules is the Knearestneighbor
rule in which an unknown pattern is classified into the majority class
among its K nearest neighbors in the training set. Steven Wang, York University We introduce a novel statistical procedure to cluster categorical data based on distance vectors. The proposed method is conceptually simple and computationally straightforward as it does not require any specific statistical models or any convergence criteria. Moreover, unlike most existing algorithms that compute the class membership or membership probability for every data point at each iteration, our algorithm sequentially extracts clusters from the given data set. That is, at each iteration our algorithm only strives to identify one cluster, which will then be deleted from the data set at the next iteration; this procedure repeats until there are no more significant clusters in the remaining data. Consequently, the number of clusters can be determined automatically by the algorithm. As for the identification and extraction of a cluster, we first locate the cluster center by using a Pearson Chisquare type statistic on the basis of categorical distance vectors. The partition of the data set produced by our algorithm is unique and robust to the input order of data points. The performance of the proposed algorithm is examined by both simulated and real world data sets. Comparisons with two wellknown clustering algorithms, Kmodes and AutoClass, show that the proposed algorithm substantially outperforms these competitors, with the classification rate or the information gain typically improved by several orders of magnitudes. S. Stanley Young, National Institute of Statistical
Sciences Drug discovery is coming into multiple large data sets, micro arrays, protein arrays, chemical structural descriptors. All of these data sets have many more columns than rows, n << p. We will explore various methods of pattern matching between multiple twoway tables. The benefit of the methods will be the discovery and possible validation of biological targets for drug discovery. Ruben. Zamar, University of British Columbia We consider the role of various performance measures for supervised and
unsupervised learning. Real world applications of datamining typically
require In our experience, the choice of a good subset of variables to perform
a data mining task may be at least as important as the choice of the procedure.
We We will also discuss the problem of detecting a relative small fraction
of "interesting" cases (in the unsupervised learning setup)
using robust Mahalanobis distances and distance based outliers. Ji Zhu, Univesity of Michigan The support vector machine is a widely used tool for classification. In this talk, we consider two types of the support vector machine: the 1norm SVM and the standard 2norm SVM. Both types can be written as regularized optimization problems. In all current implementations for fitting an SVM model, the user has to supply a value for the regularization parameter. To select an appropriate regularization parameter, in practice, people usually prespecify a finite set of values for the regularization parameter that covers a wide range, then either use a separate validation data set or use crossvalidation to select a value for the regularization parameter that gives the best performance among the given set. In this talk, we argue that the choice of the regularization parameter can be critical. We also argue that the 1norm SVM may have some advantage over the standard 2norm SVM under certain situations, especially when there are redundant noise features. We show that the solution paths for both the 1norm SVM and the 2norm SVM are piecewise linear functions of the regularization parameter. We then derive two algorithms, respectively for the 1norm SVM and the 2norm SVM, that can fit the entire paths of SVM solutions for every value of the regularization parameter, hence facilitate adaptive selection of the regularization parameter for SVM models. It turns out that the piecewise linear solution path property is not unique to the SVM models. We will propose some general conditions on the generic regularized optimization problem for the solution path to be piecewise linear, which suggest some new useful predictive statistical modeling tools. This is joint work with Saharon Rosset (IBM T.J.Watson), Trevor Hastie (Stanford U.), and Rob Tibshirani (Stanford U.) Mu Zhu, University of Waterloo We construct a special radial basis function (RBF) network model to detect
items belonging to a rare class from a large database. Our primary example
is a real drug discovery application. Our method can be viewed as modeling
only the rare class but allowing for local adjustments depending on the
density of the background class in local neighborhoods. We offer a statistical
explanation of why such an approach is appropriate and efficient for the
detection problem. Our statistical explanation together with our empirical
success with this model has implications for a new paradigm for solving
these detection problems in general. This work is joint with Wanhua Su
and Hugh Chipman. Djamel A. Zighed, University of Lyon 2, France The presentation will focus on tree issues concerning the construction of induction graphs from data. The first one is how to obtain the optimal joinpartitioning of the contingency
tables. We propose a quasioptimal algorithm for resizing contingency
tables, which takes into account the type (nominal or ordinal) of the
attributes crossed with the categorical variable to predict. Alex Zolotovitski, Generation 5 Automated Trade area analysis. Case study of G5 MWM software application We created an automatized tool for estimating, at the Zip+4 level, the
dollar amount that each US household (approximately 35 million zip+4's)


