January 26 – 30, 2015 Workshop on Big Data and Statistical Machine Learning
Organizing Committee
Ruslan Salakhutdinov (Chair) Dale Schuurmans Yoshua Bengio	Hugh Chipman Bin Yu

Back to Workshop index

Speaker Abstracts

Samy Bengio, Google Inc
The Battle Against the Long Tail

The curse of the long tail, where a small number of objects/words/classes appear very often and are thus easy to model, while many many more are rare and thus hard to model, has always been a problem in machine learning. This presentation starts by stating the problem and explaining why representation sharing in general, and embedding approaches in particular, are good to represent tail objects. Several embedding approaches are then presented, in increasing levels of complexity, to show how to tackle the long tail problem, from rare classes to unseen classes in image classification, as well as rare words in speech recognition; Finally, we present our latest results on image description, which can be seen as an ultimate rare class problem since each image is attributed to a novel, yet structured, class in the form of a meaningful sentence.

Yoshua Bengio, Université de Montréal
Exploring alternatives to Boltzmann machines

Boltzmann machines and their variants (restricted or deep) have been the dominant model for generative neural network models for a long time and they are appealing among other things because of their relative biological plausibility (say, compared to back-prop). We start this presentation by discussing some of the difficulties we encountered in training them, and undirected graphical models in general, and ask the question of the existence of credible alternatives which avoid the issues with the partition function gradient and mixing between modes with MCMC methods. We review advances of recent years to train deep unsupervised models that capture the data distribution, all related to auto-encoders, and that avoid the partition function and MCMC issues. In particular recent theoretical and empirical work on denoising auto-encoders can be extended to train deep generative models with latent variables. The presentation will end with an introduction to on-going exploration of a framework that is meant to be more biologically plausible as well as an alternative to both Boltzmann machines and back-propagation.

David Blei, Princeton University
Probabilistic Topic Models and User Behavior

Traditional topic modeling algorithms analyze a document collection and estimate its latent thematic structure. However, many collections contain an additional type of data: how people use the documents. For example, readers click on articles in a newspaper website, scientists place articles in their personal libraries, and lawmakers vote on a collection of bills. Behavior data is essential both for making predictions about users (such as for a recommendation system) and for understanding how a collection and its users are organized.

In this talk, I will review the basics of topic modeling and describe our recent research on collaborative topic models, models that simultaneously analyze a collection of texts and its corresponding user behavior. We studied collaborative topic models on 80,000 scientists' libraries from Mendeley and 100,000 users' click data from the arXiv. Collaborative topic models enable interpretable recommendation systems, capturing scientists' preferences and pointing them to articles of interest. Further, these models can organize the articles according to the discovered patterns of readership. For example, we can identify articles that are important within a field and articles that transcend disciplinary boundaries.

Yura Burda, Fields Institute
Raising the Reliability of Estimates of Generative Performance of MRFs

Markov Random Fields are powerful and expressive models used in multiple tasks in computer vision and machine learning. They can encode complicated multidimensional distributions, like the distribution of pixel intensities in pictures of handwritten digits or characters. One cannot however evaluate the generative performance of these models, since exact computation of data log-likelihood is intractable in most cases and approximate methods, such as AIS, tend to provide overly optimistic estimates of the performance.

This talk will introduce the “RAISE” procedure for replacing a given MRF with a directed graphical model that can approximate the original MRF arbitrarily well. Being a directed model, this approximation allows one to draw exact samples from it and compute stochastic lower bounds on the data log-likelihood. This tractable approximation can be then used for model selection and model comparisons. As an example, it has been suggested that previously published Deep Belief Networks provide state-of-the-art generative models of the MNIST dataset of handwritten digits. Using the RAISE procedure we can now be sure that they are

Brendan Frey, University of Toronto
The infinite genome project: Using statistical induction to understand the genome and improve human health

A common $1-billion misconception in the genomics community is that sequencing a lot of genomes will be necessary for understanding and curing disease. In my talk, I'll explain how statistical induction using a single genome can be used to identify disease mutations more accurately than projects that have involved thousands of genomes. I'll describe my group's research on developing a good statistical model that can generalize across genomes and make accurate predictions for mutations that have never been seen before (Xiong et al, Science, January 9, 2015). Another workshop speaker, Cynthia Rudin, sensibly says "possibly the most important obstacle in the deployment of predictive models is the fact that humans simply do not trust them." However, my experience with high-stakes tasks leads me to the following paraphrase: "possibly the most important advantage of machine learning in science and medicine is the fact that humans simply do not trust other humans."

Roger Grosse, University of Toronto
Scaling up natural gradient by factorizing Fisher information

Training high-dimensional probabilistic models with natural gradient descent is difficult, because it requires solving linear systems involving a large Fisher information matrix. Approximations typically use efficiently invertible but potentially inaccurate approximations (e.g. diagonal) to the Fisher information, or require expensive iterative procedures to approximately solve the linear system. I'll present an algorithm called Factorized Natural Gradient (FaNG), where the Fisher information matrix is approximated with a Gaussian graphical model whose precision matrix can be computed efficiently. This approximation is expressive, yet allows for efficient updates. I analyze the Fisher information matrix for a small RBM and derive a graphical model which is a good match for the distribution over sufficient statistics. Using FaNG with this structure allows RBMs to be trained more efficiently compared with stochastic gradient descent. Additionally, this analysis helps explain the surprisingly good performance of the "centering trick."

John Langford, Microsoft Research
Learning to explore

In medicine, randomized clinical trials are used to determine treatments rather than machine learning. Here, I'll talk about why and about how to best combine machine learning with randomized exploration. The solution has very broad implications for many applications of interactive learning such as search, recommendation, and resource allocation where (critically) feedback is available for choices taken, but not those not taken.

Radford Neal, University of Toronto
Learning to Randomize and Remember in Partially-Observed Environments

Reinforcement learning methods are well developed for contexts where the complete state of the world is observed. However, most realistic learning problems for humans, animals, and robots require that we learn to act based on observations that provide only partial information about the state of the world. Remembering our past observations and actions can then be beneficial, but if we have limited memory and computation, we need to learn what to remember. Furthermore, if we do not remember everything, it can be beneficial to learn to act randomly. I will present new reinforcement learning methods for this context, and discuss how they might be extended to large-scale learning problems.

Joelle Pineau, McGill University
Practical kernel-based reinforcement learning

Kernel-based reinforcement learning (KBRL) stands out among reinforcement learning algorithms for its strong theoretical guarantees. Unfortunately, the model constructed by KBRL grows with the number of sample transitions, resulting in a computational cost that precludes its application to large-scale domains. In this talk I will present new results showing how to turn KBRL into practical reinforcement learning methods for large datasets by leveraging the stochastic factorization trick. I will discuss both offline and online variants of the kernel-based stochastic factorization (KBSF) algorithm, and describe upper bounds for the distance between the value functions computed by KBRL and KBSF using the same data. Finally, I will illustrate the potential of this method with several empirical results, including domains from robotics and optimal treatment design.

Daniel Roy, University of Toronto
Mondrian Forests: Efficient Online Random Forests

Ensembles of randomized decision trees, usually referred to as random forests, are widely used for classification and regression tasks in machine learning and statistics. Random forests achieve competitive predictive performance and are computationally efficient to train and test, making them excellent candidates for real-world prediction tasks. The most popular random forest variants (such as Breiman's random forest and extremely randomized trees) operate on batches of training data. Online methods are now in greater demand. Existing online random forests, however, require more training data than their batch counterpart to achieve comparable predictive performance. In this work, we use Mondrian processes (Roy and Teh, 2009) to construct ensembles of random decision trees we call Mondrian forests. Mondrian forests can be grown in an incremental/online fashion and remarkably, the distribution of online Mondrian forests is the same as that of batch Mondrian forests. Mondrian forests achieve competitive predictive performance comparable with existing online random forests and periodically re-trained batch random forests, while being more than an order of magnitude faster, thus representing a better computation vs accuracy tradeoff.

Cynthia Rudin, MIT CSAIL and Sloan School of Management
Thoughts on Interpretable Machine Learning

Possibly *the* most important obstacle in the deployment of predictive models is the fact that humans simply do not trust them. If it is known exactly which variables were important for the prediction and how they were combined, this information can be very powerful in helping to convince people to believe (or not believe) the prediction and make the right decision. In this talk I will discuss algorithms for making these non-black box predictions including:

1) "Bayesian Rule Lists" - This algorithm builds a decision list using a probabilistic model over permutations of IF-THEN rules. It competes with the CART algorithm for building accurate-yet-interpretable logical models. It is not a greedy algorithm like CART.

2) "Falling Rule Lists" - These are decision lists where the probabilities decrease monotonically along the list. These are really useful for medical applications because they stratify patients into risk categories from highest to lowest risk.

3) "Bayesian Or's of And's" - These are disjunctions of conjunction models (disjunctive normal forms). These models are natural for modeling customer preferences in marketing.

4) "The Bayesian Case Model" - This is a case-based reasoning clustering method. It provides a prototypical exemplar from each cluster along with the subspace that is important for the cluster.

I will show lots of applications of these models to healthcare and marketing.

links:

Building Interpretable Classifiers with Rules using Bayesian Analysis: Building a Better Stroke Prediction Model http://web.mit.edu/rudin/www/LethamRuMcMa14.pdf
This is joint work with my student Ben Letham and colleagues Tyler McCormick and David Madigan. This paper was the winner of Data Mining Best Student Paper Award, INFORMS 2013. It also won the Student Paper Competition sponsored by the Statistical Learning and Data Mining section (SLDM) of the American Statistical Association, 2014.

Falling Rule Lists.
http://arxiv.org/abs/1411.5899
This is joint work with my student Fulton Wang.
Proceedings of Artificial Intelligence and Statistics (AISTATS), 2015 This paper won the Student Paper Competition sponsored by the Statistical Learning and Data Mining section (SLDM) of the American Statistical Association, 2015.

Bayesian Or's of And's for Interpretable Classification with Application to Context Aware Recommender Systems. http://web.mit.edu/rudin/www/TechReportBOA_WangEtAl.pdf
This is joint work with my student Tong Wang, colleague Finale Doshi, and also Yimin Liu, Erica Klampfl and Perry MacNeille from Ford Motors.

The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification
http://web.mit.edu/rudin/www/KimRuSh14.pdf
Proceedings of NIPS, 2014.
This is joint work with student Been Kim and colleague Julie Shah.

Other links are on my webpage under "Interpretable Predictive Models" http://web.mit.edu/rudin/www/PredicticsPapers.html

Alexander Schwing, The Fields Institute
Deep Learning meets Structured Prediction

Many problems in real-world applications involve predicting several random variables which are statistically related. Markov random fields (MRFs) are a great mathematical tool to encode such dependencies. Within this talk we'll show how to combine MRFs with deep learning algorithms to estimate complex representations while taking into account the dependencies between the output variables. Towards this goal, we propose a training algorithm that is able to learn structured models jointly with deep features that form MRF potentials. In addition, considering the inference task, we'll present an algorithm which efficiently extracts the globally optimal solution of the commonly employed LP relaxation. The discussed algorithm finds the steepest descending direction from the set of epsilon-subdifferentials. We illustrate how the Frank-Wolfe procedure can be used to parallelize the resulting procedure.

Raquel Urtasun, University of Toronto

Hau-tieng Wu, University of Toronto
Structure massive data by graph connection Laplacian and its application
Spectral methods like graph Laplacian and diffusion maps in data analysis are based on eigenvectors and eigenvalues of graph Laplacians. Recently, we introduced a generalized framework called the graph connection Laplacian (GCL) to structure the massive dataset. In this talk, we show that under the principle bundle framework, the eigenvectors and eigenvalues of the GCL converge to the connection Laplacian of the associated vector bundle in the limit of infinitely many i.i.d. random samples. We also show how the noise impacts GCL and why this framework is numerically scalable. We will apply this framework to the manifold learning and ptychography imaging problem (a special phase retrieval problem aiming to obtain an atomic level resolution of a macro object) to show how it works. This is a joint work with Noureddine El Karoui, Stefano Marchesini, Yu-Chao Tu and Amit Singer.

Richard Zemel, University of Toronto
Learning Rich But Fair Representations

Information systems are becoming increasingly reliant on statistical inference and learning to render all sorts of decisions, including the issuing of bank loans, the targeting of advertising, and the provision of health care. This growing use of automated decision-making has sparked heated debate among philosophers, policy-makers, and lawyers, with critics voicing concerns with bias and discrimination. Bias against some specific groups may be ameliorated by attempting to make the automated decision-maker blind to some attributes, but this is difficult, as many attributes may be correlated with the particular one. The basic aim then is to make fair decisions, i.e., ones that are not unduly biased for or against specific subgroups in the population. We formulate fairness as an optimization problem of finding a good representation of the data with two competing goals: to encode the data as well as possible, while simultaneously obfuscating any information about membership in the specific group. This is a computationally challenging objective, with links to several problems and approaches, including anonymity and the information bottleneck. I will present models we have developed towards this goal, and show that they allow trade-offs between the system desiderata. I will also describe a direction we are currently exploring to use the same underlying computational model to learn unbiased and rich nonlinear representations.