Ejaz Ahmed, Brock University
Big Data Analysis: The Universe is not Sparse
In high-dimensional statistics settings where number of variables is
greater than observations, or when number of variables are increasing with
the sample size, many penalized regularization strategies were investigated
for simultaneous variable selection and post-estimation. A model at hand
may have sparse signals as well as with a number weak signals. In this scenario
aggressive variable selection procedures may not clearly distinguish predictors
with weak signals and sparse signals. The prediction based on a selected
submodel may not be preferable in such cases. For this reason, we propose
a high-dimensional shrinkage estimation strategy to improve the prediction
performance of a submodel. Such a high-dimensional shrinkage estimator (HDSE)
is constructed by shrinking a ridge estimator in the direction of a candidate
submodel. We demonstrate that the proposed HDSE performs uniformly better
than the ridge estimator. Interestingly, it improves the prediction performance
of given candidate submodel generated from most existing variable selection
methods. The relative performance of the proposed HDSE strategy is appraised
by both simulation studies and the real data analysis.
Naomi Altman, The Pennsylvania State University
Generalizing Principal Components Analysis
Dimension reduction and feature selection methodologies are key to reducing
the complexity and volume of data while suppressing noise and retaining
informative dimensions and features. In this talk I discuss the attributes
of principal components analysis (PCA) that make it useful for the analysis
of high dimensional data. In elliptical families, many of the useful attributes
coincide, but this is not the case for other types of data. Hence, generalizations
need to be tailored to the particular use.
Robert Bell, AT&T
Big Data: It's Not the Data
The ability to collect ever larger and more complex data sets raises the
potential, often over hyped, for great breakthroughs in many application
areas. But big data per se cannot produce new insights. More of the wrong
data for the question at hand is simply a bigger headache. Reaping the promised
rewards requires careful analysis and solving challenges of a new class
of large and complex models. I will introduce some themes that I expect
to run throughout the workshop and illustrate them using examples from the
Netflix Prize competition.
David Buckeridge, McGill University
Using (Big) Data to Address Challenges in Public Health
A critical lack of population health information poses a barrier
to effective public health practice. Novel sources of data can help to address
information needs in practice, but these data must be combined appropriately
with existing data sources and considered in the context of existing knowledge.
In this presentation, I will present examples of how novel data sources are
being used to address pressing public health problems in the prevention and
control of chronic and infectious diseases.
Hugh Chipman, Acadia University
An overview of Statistical Learning
Like machine learning, the field of statistical learning seeks to ``learn
from data''. I will review some of the central ideas that identify statistical
learning, including regularization and the bias-variance trade-off, resampling
methods such as cross-validation for selecting the amount of regularization,
the role of a probability model for data, and quantification of uncertainty.
These ideas will be discussed in the context of popular recent approaches
to supervised and unsupervised learning.
Sheelagh Carpendale, University of Calgary
Information Visualization: Making Data Accessible
Information Visualization has been defined by Card, Mackinlay and
Shneiderman as “The use of computer-supported, interactive, visual representations
of abstract data to amplify cognition”. This rather ambitious definition
has fueled burgeoning interest in and hope for the potential value of information
visualization. This interest has further been amplified by the growing deluge
of data, triggering questions about whether Information Visualization can
be part of the key to unlocking the possible benefits of big data. In this
talk I will discuss successes and challenges of current research in Information
Visualization.
Dianne Cook, Iowa State University
Data Visualization and Statistical Graphics in Big Data Analysis
In this information age we are drowning in data. Good data visualisation
helps us to swim, digest the data, and learn about our world. In statistical
data visualisation plots are designed to support and enrich the statistical
processes of data exploration, modeling, and inference. As a result, statistical
data visualisation has some unique features which differentiates it from visualisations
made in other fields. Statisticians are always concerned with variability
in observations and error in measurements, both of which cause uncertainty
about conclusions drawn from data. Dealing with this uncertainty is at the
heart of classical statistics, and statisticians have developed a huge body
of inferential methods that help to quantify uncertainty. Statistical data
graphics cover a spectrum of methods including elegant static data visualisations
to highly interactive and dynamic graphics used for exploratory data analysis.
Technology changes at a rapid pace today, also, and statistical graphics needs
to be re-thought to adapt.
Data visualization gives statistical analysis a competitive edge, helping
to build better models, more accurate predictions - we'll show two examples
where groups have won data mining competitions based on using visualization
to better understand the data provided. Visualization enables exploration
and discovery of unexpected features in data, which have been traditionally
thought of as orthogonal to statistical inference. But they are not mutually
exclusive, and it is possible to do inference with data graphics, which
will be explained.
Data examples from settings like the tech boom and bust of the late 1990s,
food quality control, atmospheric CO2 levels and temperature changes, stimulus
fund spending, university ranking, PISA education data, labor market wages,
stock market trends, or soybean breeding for agribusiness, will be used.
Charmaine Dean, Western University
Wildfire and Forest Disease Prediction to Inform Forest Management:
Statistical Science Challenges
Wildfire is an important system process of the earth that occurs across
a wide range of spatial and temporal scales. A variety of methods have been
used to predict wildfire phenomena during the past century to better our
understanding of fire processes and to inform fire and land management decision-making.
Statistical methods have an important role in wildfire prediction due to
the inherent stochastic nature of fire phenomena at all scales. Predictive
models have exploited several sources of data describing fire phenomena.
Experimental data are scarce; observational data are dominated by statistics
compiled by government fire management agencies, primarily for administrative
purposes and increasingly from remote sensing observations. Fires are rare
events at many scales. The data describing fire phenomena can be zero-heavy
and nonstationary over both space and time. Users of fire modeling methodologies
are mainly fire management agencies often working under great time constraints,
thus, complex models have to be efficiently estimated. This talk focuses
on providing an understanding of some of the information needed for fire
management decision-making and of the challenges involved in predicting
fire occurrence, growth and frequency at regional, national and global scales.
The talk also considers the use of novel techniques in forest ecology such
as joint outcome modeling of multivariate spatial data, where outcomes include
count as well as zero-inflated count data. The framework utilized for the
joint spatial count outcome analysis reflects that which is now commonly
employed for the joint analysis of longitudinal and survival data, termed
shared frailty models, in which the outcomes are linked through a shared
latent spatial random risk term. We discuss these types of joint mapping
models and consider the benefits achieved and challenges of such joint modeling
in a specific ecological study of Comandra blister rust infection of lodgepole
pine trees.
Chad Gaffield, University of Ottawa
Big Data vs. Human
Complexity: An early status report on the central question of the 21st century
How can the so-called data tsunami enhance knowledge and understanding
of human thought and behavior? This question has unexpectedly moved to
the top of agendas across businesses, governments and public and non-profit
institutions. In the new societal landscape of customer-driven marketplaces,
citizen-engaged politics, student-centered schools, and patience-oriented
health services, data on people¹s thoughts and actions are the new
foundation for policy and practice. As a result, research fields previously
considered as secondary or ³soft² are fast becoming central
to campus-wide and community-connected initiatives to study people through
data analytics. From literary scholars to historians and visual artists,
researchers across the humanities and social sciences are now collaborating
with computer scientists, neuroscientists, geneticists and other researchers
in the natural and health sciences as well as engineering. The early results
are stunning and disappointing, insightful and confusing, encouraging
and frustrating. This presentation will offer an early status report on
successes and setbacks in order to propose possible next steps in meeting
the 21st century challenge of making a better future through enhanced
knowledge and understanding of human thought and behavior.
Yulia Gel, University of Waterloo
The Role of Modern Social Media Data in Surveillance and Prediction of
Infectious Diseases: from Time Series to Networks
The rampant growth of digital technologies has revolutionized the volume,
velocity and variety of available information on all aspects of our life,
from consumer behavior to health, and one of the key driving forces in this
Big Data paradigm is expansion and mobility of social media data. At the
same time, despite many promising approaches in modern surveillance methodology,
the lack of observations for near real-time forecasting still remains the
key challenge obstructing operational prediction and control of high virulent
infectious disease dynamics. For instance, even CDC data for well monitored
areas are two weeks behind, as it takes time to confirm influenza like illness
(ILI) as flu, while two weeks is a substantial time in terms of flu transmission.
These limitations coupled with the new possibilities brought by the Big
Data paradigm have ignited the recent interest in searching for alternative
near real-time data sources on the current epidemic state and, in particular,
in the wealth of health-related information offered by modern social media.
For example, Google Flu Trends uses flu-related searches to predict a future
epidemiological state at a local level, and more recently, Twitter has also
proven to be a very valuable resource for a wide spectrum of public health
applications. In this talk we will review capabilities and limitations of
such social media data as early warning indicator of influenza dynamics
in conjunction with traditional time series epidemiological models and with
more recent random network approaches accounting for heterogeneous social
interaction patterns.
Mark Girolami, Warwick University
Differential Geometric Simulation Methods for Uncertainty Quantification
in Large Scale PDE Systems
Characterising uncertainty, from a Bayesian perspective, in computer models
comprised of large scale and stiff systems of Partial Differential Equations
(PDE) becomes challenging when fine meshes and distributed parameters have
to be defined and inferred in an inverse problem setting. This talk presents
recent work which exploits the use of geodesic flows on the manifold of
statistical models defined by the PDE sensitivity equations to sample from
the desired Bayesian posterior distribution over all unknowns. The talk
will consider the role that deterministic approximations has to play in
this scheme and will illustrate the ideas presented by considering system
models of elliptic and parabolic PDEs.
Adam Tauman Kalai, Microsoft Inc
Machine learning and crowdsourcing
We explore various ways crowdsourcing can help machine learning, ranging
from simply labeling data to gathering/creating data to selecting features
and even to choosing the ultimate problem to solve.
Sallie Keller, Virginia Tech
Building Resilient Cities: Harnessing the Power of Urban Analytics
The city lies at the heart of modern life. By 2030, 5 billion people, or
about 60% of the world’s population, are projected to live in cities,
up from the current 3.6 billion people. With population concentrations and
growth at that level, the policies and programs established in the management
of urban and peri-urban areas will have enormous impact on the physical,
mental, social, and financial well being of humanity and the quality and
diversity of the environment on the planet. Every policy implemented in
the city and every activity undertaken, with rare exception, is directed
toward achieving at least one of the following four paramount goals: survival
and fostering healthy growth, addressing and prevailing in competition for
resources, adapting to subtle and major catastrophes and disruptions, and
creating and exploiting innovations and opportunities for the benefit of
the city and all inhabitants. In this context, resources are critical to
the outcomes. Data has become a new and important resource – a new
asset class. The presentation focuses on the capture of data streams and
the development of analytics to provide guidance (quantitative statistical
evidence) for urban decision-making, policies and programs.
Sham Kakade, Miscrosoft Inc.
Non-convex approaches to learning representations
Learning how to represent data (often done via unsupervised methods) is
a central challenge in modern data analysis settings, where the objective
is to model the interactions between multiple observations utilizing possible
hidden causes (or latent factors). Parameter estimation of most natural
latent variable models --- hidden Markov models, Gaussian mixture models,
latent Dirichlet allocation, and sparse coding --- is often phrased as (seemingly
intractable) non-convex optimization procedures. In practice, this leads
to the use of various search heuristics (like the EM algorithm), sampling
based approaches, or convex relaxations. While in some cases these methods
work well in practice, there is little understanding of the behavior of
these algorithms in theory, or if the problem of parameter estimation in
these models is fundamentally intractable.
This talk will summarize a series of works showing that in fact many of
these models are learnable (with polynomial time algorithms), based on natural
non-degeneracy assumptions. We will examine notions of statistical, computational,
and informational complexity (the latter being akin to the question of what
statistics one needs to examine in order to identify the model). We will
also briefly examine connections to the methods used for dictionary learning.
Finally, implications for representation learning and optimization in deep
learning will be discussed.
Eric D. Kolaczyk, Boston University
A Whirlwind Tour of Statistical Analysis of Network Data
Over the past 15+ years, the study of so-called "complex networks"
— that is, network-based representations of complex systems —
has taken the sciences by storm. Researchers from biology to physics, from
economics to mathematics, and from computer science to sociology, are more
and more involved with the collection, modeling and analysis of network-indexed
data. With this enthusiastic embrace of networks across the disciplines
comes a multitude of statistical challenges of all sorts — many of
them decidedly non-trivial. In this talk I will present a (very!) brief
overview of foundations common to the statistical analysis of network data
across the disciplines, from a statistical perspective, in the context of
topics like network summary and visualization, network sampling, network
modeling and inference, and network processes. Focus will be on highlighting
the curious mix of similarities and differences in statistical challenges
encountered with network problems, as compared to more traditional data.
Concepts will be illustrated drawing on examples from various domain areas,
including bioinformatics, computer network traffic analysis, neuroscience,
and social media.
Bo Li, University of Illinois at Urbana-Champaign
Reconstructing Past Temperatures using Short- and Long-memory Models
Understanding the dynamics of climate change in its full richness requires
the knowledge of long temperature time series. Although longterm, widely
distributed temperature observations are not available, there are other
forms of data, known as climate proxies, that can have a statistical relationship
with temperatures and have been used to infer temperatures in the past before
direct measurements. We propose a Bayesian hierarchical model(BHM) to reconstruct
past temperatures that integrates information from different sources, such
as proxies with different temporal resolution and forcings acting as the
external drivers of large scale temperature evolution. The reconstruction
method is assessed, using a global climate model as the true climate system
and with synthetic proxy data derived from the simulation. Then we apply
the BHM to real datasets and produce new reconstructions of Northern Hemisphere
annually averaged temperature anomalies back to 1000AD. We also explore
the effects of including external climate forcings within the reconstruction
and of accounting for short-memory and long-memory features. Finally, we
use posterior samples of model parameters to arrive at an estimate of the
transient climate response to greenhouse gas forcings of 2:5C (95% credible
interval of [2:16, 2:92]C), which is on the high end of, but consistent
with, the expert-assessment-based uncertainties given in the recent Fifth
Assessment Report of the IPCC.
Han Liu ( Presented by Ethan X. Fang), Princeton University
Testing and Confidence Intervals for High Dimensional Proportional
Hazards Model
We propose a decorrelation-based approach to test hypotheses and
construct confidence intervals for the low dimensional component of high dimensional
proportional hazards models. Motivated by the geometric projection principle,
we propose new decorrelated score, Wald and partial likelihood ratio statistics.
Without assuming model selection consistency, we prove the asymptotic normality
of these test statistics, establish their semiparametric optimality, and conduct
power analysis under Pitman alternatives. In addition, we also develop new
procedures for constructing pointwise confidence intervals for the baseline
hazard function and baseline survival function.
Lisa Lix, University of Manitoba
Chronic Disease Research and Surveillance: The Power of Big Databases,
the Challenges of Data Quality
Electronic health databases, including administrative health data, electronic
medical record data, and clinical registries, are used extensively in Canada
to conduct policy-relevant chronic disease research and surveillance. The
popularity of these databases has arisen because they are available in a
timely manner, contain information about large numbers of individuals, and
are relatively inexpensive to access and use. However, the quality of these
databases for research and surveillance has been questioned, and has led
to multiple studies on this topic, most of which have focused on the validity
of disease diagnostic information. In this talk, I will present a comprehensive
data quality framework, describe different statistical methods for evaluating
data quality, and illustrate these techniques with numeric examples.
Richard Lockhart, Simon Fraser University
Inference after LASSO -- limits and limitations
I will survey some work with Jon Taylor, Ryan Tibshirani and Rob Tibshirani
on inference for the parameters when LASSO or LARS is used to fit a linear
regression with many parameters. Under strong sparseness assumptions we
have both asymptotic distribution theory and exact finite sample theory
for those coefficients picked by some model selection procedures. One concern
is power; I will discuss some preliminary ideas about what is possible and
try to show how LeCam's contiguity ideas can be relevant at unusual distances
from null hypotheses in high dimensions.
Sofia Olhede, University College London
Understanding Large Networks
Networks have become pervasive in practical applications. Understanding
large networks is hard, especially because of a number of typical features
present in such observations create a number of technical analysis challenges.
I will discuss some basic network models that are tractable for analysis,
what sampling properties they can reproduce, and some results relating to
their inference.
C. Shane Reese, Brigham Young University
From Basis Expansions to Insurgency Prediction: Applications of Bayesian
Compressive Sensing
Characterization of data cardinality is no longer limited to numbers of
rows and columns as the data streams are complex and arrays are (usually)
insufficient to capture the complexity. As data complexity grows with the
size of the data, statistical methods must evolve to accommodate the richness
of the data as well as the demands for computation in (near) real time.
We present several applications of emerging statistical methods to data
of moderate complexity and size. The first application involves an infinite
near basis expansion of an intractable physics phenomenon with important
socio-economic implications in the formation of alloys. The second application
illustrates the simultaneous use of model selection and model estimation
techniques for the assessment of whether insurgencies in South American
nations can be predicted from social media activity. While both applications
are of moderate size, the computational efficiency of the algorithms and
data analytic tools illustrate the feasibility of application to truly big
data.
Ruslan Salakhutdinov, University of Toronto
Recent Advanced in Deep Learning: Learning Structured, Robust, and Multimodal
Models
Building intelligent systems that are capable of extracting meaningful representations
from high-dimensional data lies at the core of solving many Artificial Intelligence
tasks, including visual object recognition, information retrieval, speech
perception, and language understanding.
In this talk I will first introduce a broad class of deep learning models
and show that they can learn useful hierarchical representations from large
volumes of high-dimensional data with applications in information retrieval,
object recognition, and speech perception. I will then describe a new class
of more complex models that combine deep learning models with structured
hierarchical Bayesian models and show how these models can learn a deep
hierarchical structure for sharing knowledge across hundreds of visual categories.
Finally, I will introduce deep models that are capable of extracting a unified
representation that fuses together multiple data modalities as well as discuss
models that can generate natural language descriptions of images. I will
show that on several tasks, including modelling images and text, video and
sound, these models significantly improve upon many of the existing techniques.
Alexandra M. Schmidt , IM-UFRJ, Brazil
An overview of covariance structures for spatial and spatio-temporal processes
An important aspect of statistical modeling of spatial or spatiotemporal
data is to determine the covariance function. It is a key part of spatial
prediction. The classical geostatistical approach uses an assumption of
isotropy, which yields circular isocorrelation curves. In the analysis of
most spatio-temporal processes underlying environmental studies there is
little reason to expect spatial covariance structures to be isotropic. This
is because there may be local influences in the correlation structure of
the spatial random process. Adding the temporal aspect, there is often interaction
between time and space, requiring classes of nonseparable covariance structures.
In this talk we will review some of the alternatives to the usual isotropic
models that have been proposed in the last 20 years. We will also discuss
some alternatives to massive datasets and models for multivariate and non-Gaussian
processes.
Dale Schuurmans, University of Alberta
Convex Methods for Latent Representation Learning
Automated feature discovery is a fundamental problem in data analysis.
Although classical feature learning methods fail to guarantee optimal solutions
in general, convex reformulations have been developed for a number of such
problems. Most of these reformulations are based on one of two key strategies:
relaxing pairwise representations, or exploiting induced matrix norms. Despite
their use of relaxation,convex reformulations can demonstrate improvements
in solution quality by eliminating local minima. I will discuss a few recent
convex reformulations for representative learning problems, including robust
regression, hidden-layer network training, and multi-view learning --- demonstrating
how latent representation discovery can co-occur with parameter optimization
while admitting globally optimal relaxed solutions. In some cases, meaningful
rounding guarantees can also be achieved.
Steve Scott, Google Inc
Bayes and Big Data: The Consensus Monte Carlo Algorithm
A useful definition of ``big data'' is data that is too big to comfortably
process on a single machine, either because of processor, memory, or disk
bottlenecks. Graphics processing units can alleviate the processor bottleneck,
but memory or disk bottlenecks can only be eliminated by splitting data
across multiple machines. Communication between large numbers of machines
is expensive (regardless of the amount of data being communicated), so there
is a need for algorithms that perform distributed approximate Bayesian analyses
with minimal communication. Consensus Monte Carlo operates by running a
separate Monte Carlo algorithm on each machine, and then averaging individual
Monte Carlo draws across machines. Depending on the model, the resulting
draws can be nearly indistinguishable from the draws that would have been
obtained by running a single machine algorithm for a very long time. Examples
of consensus Monte Carlo are shown for simple models where single-machine
solutions are available, for large single-layer hierarchical models, and
for Bayesian additive regression trees (BART).
Therese A Stukel, ICES
Innovative uses of big data for health policy research
Data in health care has been growing rapidly. ICES Ontario holds one
of the largest repositories of health administrative data, from physician
and hospital records to disease registries and clinical data from electronic
medical records. We illustrate innovative linkages of these data to produce
novel applications for health policy. These range from creation of Health
Service Region boundaries and multispecialty physician networks to online
calculators to predict life expectancy and hospital days.
Stephen Vavasis, University of Waterloo
Clique and Biclique: An example of using convex optimization for data mining
The clique and biclique problems are examples of data mining problems
that fall into under the general heading of clustering. We explain how to
describe these problems as nonconvex optimization, how to convexify them,
what guarantees hold for the convexification, and how to solve the convex
versions. Parts of this talk represent joint work with B. Ames. In addition,
part of the talk will survey a paper by Chen and Xu.
Martin Wainwright, University of California,
Berkeley
Statistics meets optimization: Rigorous guarantees for solving nonconvex
programs
In this talk, we provide an overview of an evolving line of work on rigorous
results for different types of nonconvex optimization problems that arise
in statistical machine learning. Nonconvex optimization is known to be intractable
in the general setting, but the problems arising in the statistical setting
are not adversarially constructed, but rather related to a population oracle
problem that can be well-behaved. We discussing how this intuition can be
made precise for a number of different problems, including high-dimensional
sparse regression with nonconvex regularization, as well as various applications
of the EM algorithm to latent variable models with nonconvex likelihoods.
Patrick J. Wolfe, University College London
Statistics meets optimization: Rigorous guarantees for solving nonconvex
programs
(joint work with Sharmodeep Bhattacharyya and Peter J. Bickel)
Exchangeable network models provide a general nonparametric class
of models for unlabeled random graphs, based on a latent variable formulation.
Typically these models are approximated by fitting stochastic blockmodels,
and the corresponding latent variable densities are estimated using piecewise-constant
or histogram-type estimators derived from the fitted blockmodel. We show that
if the latent variable density has certain smoothness properties and the fitted
blockmodel satisfies certain consistency properties, then it is possible to
obtain consistent estimators for the equivalence class of latent variable
densities under suitable metrics. We derive rates of convergence for a fitted
blockmodel to estimate the latent variable density, depending on the smoothness
properties of the density as well as the error rate of the method of blockmodel
fitting. We also propose a cross-validation method using count statistics
and their smooth functions to regularize the fitted blockmodel, and to choose
the number of blocks to approximate the latent variable density. The cross-validation
method provides us with a procedure to regularize the fitted blockmodel, independently
of the algorithm used to fit the blockmodel itself.
Douglas G. Woolford, Wilfrid Laurier University
Exploratory data analysis, visualization and modelling methods for
large data in forest fire science
Exploratory data analysis uses visualization techniques to explore for
structure in data sets, as well as departures from such structure. It is
an important first step, prior to modelling and inference, when dealing
with environmental data sets which are commonly large, complex and spatio-temporal
in nature. For example, data on forest fire occurrences, fire-weather, and
cloud-to-ground lightning strikes are recorded as spatially-referenced time
series and such information is commonly recorded at a fine scale. For the
first part of this talk I will discuss several exploratory techniques for
visualizing large data in the context of forest fire science, including:
1) a data filtering technique used to reveal dominant data surfaces, track
the paths of storm centres, and visualize typical storm systems; 2) an interactive,
3D stalagmite plot for exploring patterns in slices of long time series;
and, 3) a singular value decomposition method that is useful for exploring
spatial patterns in areal data. In the context of predicting rare events
in large data, I will then discuss how sampling observations on a fine spatio-temporal
grid can create a smaller, more computationally manageable data set when
using a logistic generalized additive model to approximate the conditional
intensity function of an underlying point process. I will also illustrate
how a well-designed sampling scheme can lead to more efficient estimates
in such models.