Samy Bengio, Google
The Battle Against the Long Tail
The curse of the long tail, where a small number of objects/words/classes
appear very often and are thus easy to model, while many many more are rare
and thus hard to model, has always been a problem in machine learning. This
presentation starts by stating the problem and explaining why representation
sharing in general, and embedding approaches in particular, are good to represent
tail objects. Several embedding approaches are then presented, in increasing
levels of complexity, to show how to tackle the long tail problem, from rare
classes to unseen classes in image classification, as well as rare words in
speech recognition; Finally, we present our latest results on image description,
which can be seen as an ultimate rare class problem since each image is attributed
to a novel, yet structured, class in the form of a meaningful sentence.
Yoshua Bengio, Université de Montréal
Exploring alternatives to Boltzmann machines
Boltzmann machines and their variants (restricted or deep) have been the dominant
model for generative neural network models for a long time and they are appealing
among other things because of their relative biological plausibility (say,
compared to back-prop). We start this presentation by discussing some of the
difficulties we encountered in training them, and undirected graphical models
in general, and ask the question of the existence of credible alternatives
which avoid the issues with the partition function gradient and mixing between
modes with MCMC methods. We review advances of recent years to train deep
unsupervised models that capture the data distribution, all related to auto-encoders,
and that avoid the partition function and MCMC issues. In particular recent
theoretical and empirical work on denoising auto-encoders can be extended
to train deep generative models with latent variables. The presentation will
end with an introduction to on-going exploration of a framework that is meant
to be more biologically plausible as well as an alternative to both Boltzmann
machines and back-propagation.
David Blei, Princeton University
Probabilistic Topic Models and User Behavior
Traditional topic modeling algorithms analyze a document collection and estimate
its latent thematic structure. However, many collections contain an additional
type of data: how people use the documents. For example, readers click on
articles in a newspaper website, scientists place articles in their personal
libraries, and lawmakers vote on a collection of bills. Behavior data is essential
both for making predictions about users (such as for a recommendation system)
and for understanding how a collection and its users are organized.
In this talk, I will review the basics of topic modeling and describe our
recent research on collaborative topic models, models that simultaneously
analyze a collection of texts and its corresponding user behavior. We studied
collaborative topic models on 80,000 scientists' libraries from Mendeley and
100,000 users' click data from the arXiv. Collaborative topic models enable
interpretable recommendation systems, capturing scientists' preferences and
pointing them to articles of interest. Further, these models can organize
the articles according to the discovered patterns of readership. For example,
we can identify articles that are important within a field and articles that
transcend disciplinary boundaries.
Yura Burda, Fields Institute
Raising the Reliability of Estimates of Generative Performance of MRFs
Markov Random Fields are powerful and expressive models used in multiple tasks
in computer vision and machine learning. They can encode complicated multidimensional
distributions, like the distribution of pixel intensities in pictures of handwritten
digits or characters. One cannot however evaluate the generative performance
of these models, since exact computation of data log-likelihood is intractable
in most cases and approximate methods, such as AIS, tend to provide overly
optimistic estimates of the performance.
This talk will introduce the RAISE procedure for replacing a
given MRF with a directed graphical model that can approximate the original
MRF arbitrarily well. Being a directed model, this approximation allows one
to draw exact samples from it and compute stochastic lower bounds on the data
log-likelihood. This tractable approximation can be then used for model selection
and model comparisons. As an example, it has been suggested that previously
published Deep Belief Networks provide state-of-the-art generative models
of the MNIST dataset of handwritten digits. Using the RAISE procedure we can
now be sure that they are
Brendan Frey, University of Toronto
The infinite genome project: Using statistical induction to understand
the genome and improve human health
A common $1-billion misconception in the genomics community is that sequencing
a lot of genomes will be necessary for understanding and curing disease. In
my talk, I'll explain how statistical induction using a single genome can
be used to identify disease mutations more accurately than projects that have
involved thousands of genomes. I'll describe my group's research on developing
a good statistical model that can generalize across genomes and make accurate
predictions for mutations that have never been seen before (Xiong et al, Science,
January 9, 2015). Another workshop speaker, Cynthia Rudin, sensibly says "possibly
the most important obstacle in the deployment of predictive models is the
fact that humans simply do not trust them." However, my experience with
high-stakes tasks leads me to the following paraphrase: "possibly the
most important advantage of machine learning in science and medicine is the
fact that humans simply do not trust other humans."
Roger Grosse, University of Toronto
Scaling up natural gradient by factorizing Fisher information
Training high-dimensional probabilistic models with natural gradient descent
is difficult, because it requires solving linear systems involving a large
Fisher information matrix. Approximations typically use efficiently invertible
but potentially inaccurate approximations (e.g. diagonal) to the Fisher information,
or require expensive iterative procedures to approximately solve the linear
system. I'll present an algorithm called Factorized Natural Gradient (FaNG),
where the Fisher information matrix is approximated with a Gaussian graphical
model whose precision matrix can be computed efficiently. This approximation
is expressive, yet allows for efficient updates. I analyze the Fisher information
matrix for a small RBM and derive a graphical model which is a good match
for the distribution over sufficient statistics. Using FaNG with this structure
allows RBMs to be trained more efficiently compared with stochastic gradient
descent. Additionally, this analysis helps explain the surprisingly good performance
of the "centering trick."
John Langford, Microsoft Research
Learning to explore
In medicine, randomized clinical trials are used to determine treatments rather
than machine learning. Here, I'll talk about why and about how to best combine
machine learning with randomized exploration. The solution has very broad
implications for many applications of interactive learning such as search,
recommendation, and resource allocation where (critically) feedback is available
for choices taken, but not those not taken.
Radford Neal, University of Toronto
Learning to Randomize and Remember in Partially-Observed Environments
Reinforcement learning methods are well developed for contexts where the complete
state of the world is observed. However, most realistic learning problems
for humans, animals, and robots require that we learn to act based on observations
that provide only partial information about the state of the world. Remembering
our past observations and actions can then be beneficial, but if we have limited
memory and computation, we need to learn what to remember. Furthermore, if
we do not remember everything, it can be beneficial to learn to act randomly.
I will present new reinforcement learning methods for this context, and discuss
how they might be extended to large-scale learning problems.
Joelle Pineau, McGill University
Practical kernel-based reinforcement learning
Kernel-based reinforcement learning (KBRL) stands out among reinforcement
learning algorithms for its strong theoretical guarantees. Unfortunately,
the model constructed by KBRL grows with the number of sample transitions,
resulting in a computational cost that precludes its application to large-scale
domains. In this talk I will present new results showing how to turn KBRL
into practical reinforcement learning methods for large datasets by leveraging
the stochastic factorization trick. I will discuss both offline and online
variants of the kernel-based stochastic factorization (KBSF) algorithm, and
describe upper bounds for the distance between the value functions computed
by KBRL and KBSF using the same data. Finally, I will illustrate the potential
of this method with several empirical results, including domains from robotics
and optimal treatment design.
Daniel Roy, University of Toronto
Mondrian Forests: Efficient Online Random Forests
Ensembles of randomized decision trees, usually referred to as random forests,
are widely used for classification and regression tasks in machine learning
and statistics. Random forests achieve competitive predictive performance
and are computationally efficient to train and test, making them excellent
candidates for real-world prediction tasks. The most popular random forest
variants (such as Breiman's random forest and extremely randomized trees)
operate on batches of training data. Online methods are now in greater demand.
Existing online random forests, however, require more training data than their
batch counterpart to achieve comparable predictive performance. In this work,
we use Mondrian processes (Roy and Teh, 2009) to construct ensembles of random
decision trees we call Mondrian forests. Mondrian forests can be grown in
an incremental/online fashion and remarkably, the distribution of online Mondrian
forests is the same as that of batch Mondrian forests. Mondrian forests achieve
competitive predictive performance comparable with existing online random
forests and periodically re-trained batch random forests, while being more
than an order of magnitude faster, thus representing a better computation
vs accuracy tradeoff.
Cynthia Rudin, MIT CSAIL and Sloan School of Management
Thoughts on Interpretable Machine Learning
Possibly *the* most important obstacle in the deployment of predictive models
is the fact that humans simply do not trust them. If it is known exactly which
variables were important for the prediction and how they were combined, this
information can be very powerful in helping to convince people to believe
(or not believe) the prediction and make the right decision. In this talk
I will discuss algorithms for making these non-black box predictions including:
1) "Bayesian Rule Lists" - This algorithm builds a decision list
using a probabilistic model over permutations of IF-THEN rules. It competes
with the CART algorithm for building accurate-yet-interpretable logical models.
It is not a greedy algorithm like CART.
2) "Falling Rule Lists" - These are decision lists where the probabilities
decrease monotonically along the list. These are really useful for medical
applications because they stratify patients into risk categories from highest
to lowest risk.
3) "Bayesian Or's of And's" - These are disjunctions of conjunction
models (disjunctive normal forms). These models are natural for modeling customer
preferences in marketing.
4) "The Bayesian Case Model" - This is a case-based reasoning clustering
method. It provides a prototypical exemplar from each cluster along with the
subspace that is important for the cluster.
I will show lots of applications of these models to healthcare and marketing.
Building Interpretable Classifiers with Rules using Bayesian Analysis: Building
a Better Stroke Prediction Model http://web.mit.edu/rudin/www/LethamRuMcMa14.pdf
This is joint work with my student Ben Letham and colleagues Tyler McCormick
and David Madigan. This paper was the winner of Data Mining Best Student Paper
Award, INFORMS 2013. It also won the Student Paper Competition sponsored by
the Statistical Learning and Data Mining section (SLDM) of the American Statistical
Falling Rule Lists.
This is joint work with my student Fulton Wang.
Proceedings of Artificial Intelligence and Statistics (AISTATS), 2015 This
paper won the Student Paper Competition sponsored by the Statistical Learning
and Data Mining section (SLDM) of the American Statistical Association, 2015.
Bayesian Or's of And's for Interpretable Classification with Application
to Context Aware Recommender Systems. http://web.mit.edu/rudin/www/TechReportBOA_WangEtAl.pdf
This is joint work with my student Tong Wang, colleague Finale Doshi, and
also Yimin Liu, Erica Klampfl and Perry MacNeille from Ford Motors.
The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and
Proceedings of NIPS, 2014.
This is joint work with student Been Kim and colleague Julie Shah.
Other links are on my webpage under "Interpretable Predictive Models"
Alexander Schwing, The Fields Institute
Deep Learning meets Structured Prediction
Many problems in real-world applications involve predicting several random
variables which are statistically related. Markov random fields (MRFs) are
a great mathematical tool to encode such dependencies. Within this talk we'll
show how to combine MRFs with deep learning algorithms to estimate complex
representations while taking into account the dependencies between the output
variables. Towards this goal, we propose a training algorithm that is able
to learn structured models jointly with deep features that form MRF potentials.
In addition, considering the inference task, we'll present an algorithm which
efficiently extracts the globally optimal solution of the commonly employed
LP relaxation. The discussed algorithm finds the steepest descending direction
from the set of epsilon-subdifferentials. We illustrate how the Frank-Wolfe
procedure can be used to parallelize the resulting procedure.
Raquel Urtasun, University of Toronto
Hau-tieng Wu, University of Toronto
Structure massive data by graph connection Laplacian and its application
Spectral methods like graph Laplacian and diffusion maps in data analysis
are based on eigenvectors and eigenvalues of graph Laplacians. Recently, we
introduced a generalized framework called the graph connection Laplacian (GCL)
to structure the massive dataset. In this talk, we show that under the principle
bundle framework, the eigenvectors and eigenvalues of the GCL converge to
the connection Laplacian of the associated vector bundle in the limit of infinitely
many i.i.d. random samples. We also show how the noise impacts GCL and why
this framework is numerically scalable. We will apply this framework to the
manifold learning and ptychography imaging problem (a special phase retrieval
problem aiming to obtain an atomic level resolution of a macro object) to
show how it works. This is a joint work with Noureddine El Karoui, Stefano
Marchesini, Yu-Chao Tu and Amit Singer.
Richard Zemel, University of Toronto
Learning Rich But Fair Representations
Information systems are becoming increasingly reliant on statistical inference
and learning to render all sorts of decisions, including the issuing of bank
loans, the targeting of advertising, and the provision of health care. This
growing use of automated decision-making has sparked heated debate among philosophers,
policy-makers, and lawyers, with critics voicing concerns with bias and discrimination.
Bias against some specific groups may be ameliorated by attempting to make
the automated decision-maker blind to some attributes, but this is difficult,
as many attributes may be correlated with the particular one. The basic aim
then is to make fair decisions, i.e., ones that are not unduly biased for
or against specific subgroups in the population. We formulate fairness as
an optimization problem of finding a good representation of the data with
two competing goals: to encode the data as well as possible, while simultaneously
obfuscating any information about membership in the specific group. This is
a computationally challenging objective, with links to several problems and
approaches, including anonymity and the information bottleneck. I will present
models we have developed towards this goal, and show that they allow trade-offs
between the system desiderata. I will also describe a direction we are currently
exploring to use the same underlying computational model to learn unbiased
and rich nonlinear representations.