mathematical medicine
  about us
  research - collaborations
  current activities
  organizational chart
  contact us

June 11-12, 2010
Workshop on Optimization and Data Analysis in Biomedical Informatics

Organizing Committee:
Thomas F. Coleman University of Waterloo
Sivabal Sivaloganathan, University of Waterloo
Panos Pardalos University of Florida
Petros Xanthopoulos University of Florida

Speaker Abstracts:

Richard Cook, University of Waterloo
Robust Estimation of Mean Functions and Treatment Effects for Recurrent Events under Event-Dependent Censoring and Termination: Application to Skeletal Complications in Cancer Metastatic to Bone

Recurrent events frequently arise as responses in clinical trials and other experimental settings. The definition and estimation of treatment effects involves a number of interesting issues, especially when loss to followup may be event-related and when there are terminal events such as death, which preclude the occurrence of further events. We discuss a clinical trial of breast cancer patients with bone metastases, where the recurrent events are skeletal complications, and where patients may die during the trial. A number of ways to specify treatment effects for recurrent event responses have been described. We argue here in favor of measures based on marginal rate and mean functions; estimates of marginal rate and mean functions are very useful as data summaries and as a basis for measuring the effects of treatment or other baseline covariates.  When recurrent event data are subject to event-dependent censoring ordinary marginal methods may, however, yield inconsistent estimates.

Incorporating inverse probability of censoring weights into the relevant estimating equations can protect against such censoring and yield consistent estimates of marginal features, provided the censoring process is correctly specified. An alternative approach is to obtain estimates of rate and mean functions from intensity-based models that render censoring conditionally independent, but this often complicates the interpretation of covariate effects.  We consider three methods of estimating mean functions of recurrent event processes and examine the bias and efficiency of unweighted and inverse probability weighted versions of the methods. The methods are extended to deal with the presence of a terminating event. We compare the methods via simulation and use them to deal with the clinical trial.

This is joint work with Jerry Lawless, Lajmi Lakhal-Chaieb and Ker-Ai Lee.


Christodoulos A. Floudas, Princeton University
Advances in Biclusering Methods for Re-Ordering Data Matrices in Systems Biology, Drug Discovery, and Prediction of In Vivo Toxicities from In Vitro Experimental Data

Biclustering has emerged as an important problem in the analysis of gene expression data since genes may only jointly respond over a subset of conditions. Many of the methods for biclustering, and clustering algorithms in general, utilize simplified models or heuristic strategies for identifying the “best” grouping of elements according to some metric and cluster definition and thus result in suboptimal clusters.

In the first part of the presentation, we present a rigorous approach to biclustering, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix so as to globally minimize the dissimilarity metric [1,2]. The physical permutations of the rows and columns of the data matrix can be modeled as either a network flow problem or a traveling salesman problem. The performance of OREO is tested on several important data matrices arising in systems biology to validate the ability of the proposed method and compare it to existing biclustering and clustering methods.

In the second part of the talk, we will focus on novel methods for clustering of data matrices that are very sparse [3]. These types of data matrices arise in drug discovery where the x- and y-axis of a data matrix can correspond to different functional groups for two distinct substituent sites on a molecular scaffold. Each possible x and y pair corresponds to a single molecule which can be synthesized and tested for a certain property, such as percent inhibition of a protein function. For even moderate size matrices, synthesizing and testing a small fraction of the molecules is labor intensive and not economically feasible. Thus, it is of paramount importance to have a reliable method for guiding the synthesis process to select molecules that have a high probability of success. In the second part of the presentation, we introduce a new strategy to enable efficient substituent reordering and descriptor-free property estimation. Our approach casts substituent reordering as a special high-dimensional rearrangement clustering problem, eliminating the need for functional approximation and enhancing computational efficiency [4, 5]. Deterministic optimization approaches based on mixed-integer linear programming can provide guaranteed convergence to the optimal substituent ordering. The proposed approach is demonstrated on a sparse data matrix (about 29% dense) of inhibition values for 14,043 unknown compounds provided by Pfizer Inc. It is shown that an iterative synthesis strategy is able to uncover a significant percentage of the lead molecules while using only a fraction of total compound library, even when starting from a mere 3\% of the total library space.

In the third part of the presentation, we combine the strengths of integer linear optimization and machine learning to predict in vivo toxicities for a library of pesticide chemicals using only in vitro data. Our approach utilizes a biclustering method based on iterative optimal re-ordering [1,2] to identify biclusters corresponding to subsets of chemicals that have similar responses over distinct subsets of the in vitro assays. This enables us to determine subsets of experimental assays that are most likely to be correlated with toxicity, according to the in vivo data set. An optimal method based on integer linear optimization (ILP) for re-ordering sparse data matrices [3] is also applied to the in vivo dataset (21.3% sparse) in order to cluster endpoints that have similar lowest effect level (LEL) values, where it is observed that endpoints are grouped according to similar physiological attributes. Based upon the clustering results of the in vitro and in vivo data sets, logistic regression is then utilized to (a) learn the correlation between the subsets of in vitro data and the in vivo responses, and (b) subsequently predict the toxicity signatures of the chemicals. Our approach aims to find the highest prediction accuracy using the minimum number of in vitro descriptors.

[1] DiMaggio, P. A. and McAllister, S. R. and Floudas, C. A., and Feng, X. J. and Rabinowitz, J. D. and Rabitz, H. A., "Biclustering via Optimal Re-ordering of Data Matrices in Systems Biology: Rigorous methods and comparative studies", BMC Bioinformatics, 9, 458 (2008).
[2] DiMaggio, P. A. and McAllister, S. R. and Floudas, C. A., and Feng, X. J. and Rabinowitz, J. D. and Rabitz, H. A., "A Network Flow Model for Biclustering via Optimal Re-ordering of Data Matrices", J. Global Optimization, in press, (2010).
[3 ] McAllister S.R., DiMaggio P.A., and C.A. Floudas, "Mathematical Modeling and Efficient Optimization Methods for the Distance-Dependent Rearrangement Clustering Problem", J. of Global Optimization, 45, 111-129, (2009).
[4] McAllister, S. R. and Feng, X. J. and DiMaggio, P. A. and Floudas, C. A., and Rabinowitz, J. D. and Rabitz, H. A., "Descriptor-free molecular discovery in large libraries by adaptive substituent reordering", Bioorganic & Medicinal Chemistry Letters, 18, 5967-5970, (2008).
[5] DiMaggio, P. A. and McAllister, S. R. and Floudas, C. A., and Feng, X. J. and Rabinowitz, J. D. and Rabitz, H. A., "Enhancing Molecular Discovery Using Descriptor- Free Rearrangement Clustering Techniques for Sparse Data Sets", AIChE J., 56(2), 405-418, (2010)


Hamid Ghaffari, University of Toronto
A semi-infinite linear programming approach to Gamma Knife Perfexion

Perfexion, the latest model of Gamma Knife treatment units, is used primarily for radiosurgery treatments of brain tumors and lesions.  Treatments for Perfexion must be designed entirely manually, which can lead to errors or sub-optimality treatments. We model the Perfexion treatment planning problem as a second-order conic optimization problem, and solve it using a semi-infinite linear programming approach. The resulting treatments plans meet or exceed clinical guidelines for both radiosurgery and radiotherapy treatments, and can be obtained in reasonable period of time.


Mario R. Guarracino, Italian National Research Council
Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Supervised learning refers to the ability of a system to learn from a set of examples, represented by a set of input / output pairs. This set of examples is referred to as the training set. A training system is able to provide a answer (output) to each new question (input). The term supervised derives from the fact that the output for all training examples is provided by an expert. In the simplest case, when the response is one of only two potential possibilities, it is referred to as binary classification.

Supervised learning models are applicable in many fields of science and technology, such as economics, engineering and medicine. In this study we focus on the discrimination among differ types of leukemia.

Among supervised learning algorithms, there are the so-called Support Vector Machines (SVM), exhibiting accurate solutions and low training time. They are based on the statistical learning theory and provide the solution by minimizing a quadratic type cost function. SVM, in conjunction with the use of kernel methods, provide non-linear classification models, namely separations that cannot be expressed using inequalities on linear combinations of parameters.

There are some problems that may reduce the effectiveness of these methods. In fact, we are sometimes faced with very large training sets. For example, in multi-center clinical trials, experts from different institutions collect data on many patients. In this case, techniques currently in use determine the model considering all the available data. Although they are well suited to cases under consideration, they do not provide accurate answers in general. Therefore, it is necessary to identify a subset of the training set which contains all available information, providing a model that still generalizes to new testing data.

It is also possible that the training sets vary over time, for example, because data are added and modified as a result of new tests or new knowledge. In this case, the current techniques are not able to capture the changes, but need to start the learning process from the beginning. The techniques, which extract only the new knowledge contained in the data and provide the learning model in an incremental way, have the advantage of taking into account only the experiments really useful and speed up the analysis.

In this talk, we describe some solutions to these problems, with the support of numerical experiments.


Igor Jurisica, University of Toronto
Systematic identification and characterization of cancer biomarkers

Cancer development is a multi-step process that leads to uncontrolled tumor cell growth. Multiple pathways are involved, typically some signalling and regulatory pathways are activated, while others are suppressed. Existing prognostic and predictive biomarkers overlap only partially, and often do not validate on external cohorts or by other biological assays. Several technical reasons may explain this. Further evidence shows that several clinically identical gene signatures exist, and that the best signature may not comprise the most differential genes.

To address these challenges, we developed a system for an integrative analysis, prediction and characterization of signatures and relevant protein-protein interactions. These computational predictions will lead to improved human interactome coverage relevant to both basic and cancer biology, and, importantly, may accelerate the development of novel therapies for many cancers.


Ioannis A. Kakadiaris, University of Houston
Challenges and Opportunities for Extracting Cardiovascular Risk Biomarkers from non-contrast CT data

In this talk, I will first offer a short overview of the research activities of the Computational Biomedicine Laboratory, University of Houston. Then, I will present our research in the area of biomedical image computing for the mining of information from cardiovascular imaging data for the detection of persons with a high likelihood of developing a heart attack in the near future (vulnerable patients). Specifically, I’ll present methods for detection and segmentation of anatomical structures, and shape and motion estimation of dynamic organs. The left ventricle in non-invasive cardiac MRI data is extracted using a new multi-class, multi-feature fuzzy connectedness method and deformable models for shape and volume estimation. In non-invasive cardiac CT data, the thoracic fat is detected using a relaxed version of multi-class, multi-feature fuzzy connectedness method. Additionally, the calcified lesions in the coronary arteries are identified and quantified using a hierarchical supervised learning framework from the CT data. In non-invasive contrast-enhanced CT, the coronary arteries are detected using our tubular shape detection method for motion estimation and, possibly, for non-calcified lesion detection. In invasive IVUS imaging, our team has developed a unique IVUS acquisition protocol and novel signal/image analysis methods for the detection (for the first time in-vivo) of ‘vasa vasorum’ (VV). The VV are micro-vessels that are commonly present to feed the walls of larger vessels; however, recent clinical evidence has uncovered their tendency to proliferate around areas of inflammation, including the inflammation associated with vulnerable plaques. In summary, our work is focused on developing innovative computational tools to mine quantitative parameters from imaging data for early detection of asymptomatic cardiovascular patients. The expected impact of our work stems from the fact that sudden heart attack remains the number one cause of death in the US, and unpredicted heart attacks account for the majority of the $280 billion burden of cardiovascular diseases.


Panos M. Pardalos, University of Florida
Consistent Biclustering via Nonlinear 0-1 Optimization

Biclustering is a data mining technique that consists in simultaneous partitioning of the set of samples and the set of their attributes (features) into subsets (classes). Samples and features classified together are supposed to have a high relevance to each other which can be observed by intensity of their expressions. We have defined the notion of consistency for biclustering using interrelation between centroids of sample and feature classes. We have proved that consistent biclustering implies separability of the classes by convex cones.

In this talk, we discuss both supervised and unsupervised learning algorithms, where the biclustering consistency is achieved by feature selection and outlier detection, and also simulation studies investigating the probability of existence of a consistent biclustering
on random data. The developed models involve solution of a fractional 0-1 programming problem. Computational results on DNA microarray data mining problems are reported.


Kostas Tsakalis, University of Arizona
Feedback Control for Epileptic Seizure Suppression

Experimental observations indicate that transitions to epileptic seizures are preceded by some form of increased synchronization. Similar behavior can be observed by varying the parameters in coupled chaotic oscillators, coupled neural populations, and cortical neural populations with inhibitory-excitatory inputs. Such mechanisms are very attractive tools to understand basic mechanisms producing seizure-like phenomena, as they explore general network properties and functionality and do not rely on specific mathematical models. Their simulation models also allow the quick investigation and preliminary evaluation of strategies for seizure prediction and seizure suppression.

Viewing feedback action as a functional model achieving homeostasis in various quantities like firing rate or neural output correlation, we model the generation of seizures as functional behavior of the populations rather than a generic connectivity property. We then examine the effectiveness of a variety of stimulation strategies in the suppression of seizures. Our objective is to maintain “normal activity” in terms of observed measures like synchrony, correlation and amplitude in neural systems that are subject to extrinsic and intrinsic perturbations. We evaluate several external seizure-control paradigms like open loop periodic stimulation, closed-loop ``modulating'' control with predefined stimuli, and closed-loop feedback decoupling control. While the precise implementation of these seizure suppression strategies is an open issue, both simulation models and preliminary experimental data from animal models indicate that the use of multiple sites for data collection and stimulation is essential for handling a wide range of seizure types.


Stephen Wright, University of Wisconsin-Madison
Developments in Optimization algorithms for data analysis

Data analysis problems from biomedical informatics and related areas often have large size and/or complexity, particularly when interactions between effects are considered as predictors along with main effects. Moreover, scientists often seek simple, approximate solutions, which can be obtained from optimization formulations with nonsmooth regularization terms. The demanding features of these optimization problems are driving development of algorithms that account for the difficulty of the problem, the size of the data sets, the presence of regularization terms, and the type of solutions needed. We review algorithmic developments for several classes of problems that arise in biomedical informatics and other areas.


Petros Xanthopoulos, University of Florida
Data mining approaches for a multiclass cell death discrimination problem

Accurate cell death discrimination is a time consuming and expensive process that can only be performed in biological laboratories. Nevertheless, it is very useful and arises in many biological and medical applications. In this paper we propose a data mining approach, applied on Raman spectroscopy data, for a multiclass cell death discrimination problem. We combine a regularized generalized eigenvalue algorithm for multiple classification, together with a novel dimensionality reduction algorithm based on spectral clustering. Results show that the proposed algorithmic scheme can classify cells from three different classes (apoptotic death, necrotic death and control cells) with over 97% accuracy.

copyright 2005 - Mathematical Medicine