March 27, 2017


Thematic Program on Statistical Inference, Learning, and Models for Big Data, January to June, 2015

January 12 – 23, 2015
Opening Conference and Boot Camp

Speakers Abstracts


Ejaz Ahmed, Brock University
Big Data Analysis: The Universe is not Sparse

In high-dimensional statistics settings where number of variables is greater than observations, or when number of variables are increasing with the sample size, many penalized regularization strategies were investigated for simultaneous variable selection and post-estimation. A model at hand may have sparse signals as well as with a number weak signals. In this scenario aggressive variable selection procedures may not clearly distinguish predictors with weak signals and sparse signals. The prediction based on a selected submodel may not be preferable in such cases. For this reason, we propose a high-dimensional shrinkage estimation strategy to improve the prediction performance of a submodel. Such a high-dimensional shrinkage estimator (HDSE) is constructed by shrinking a ridge estimator in the direction of a candidate submodel. We demonstrate that the proposed HDSE performs uniformly better than the ridge estimator. Interestingly, it improves the prediction performance of given candidate submodel generated from most existing variable selection methods. The relative performance of the proposed HDSE strategy is appraised by both simulation studies and the real data analysis.

Naomi Altman, The Pennsylvania State University
Generalizing Principal Components Analysis

Dimension reduction and feature selection methodologies are key to reducing the complexity and volume of data while suppressing noise and retaining informative dimensions and features. In this talk I discuss the attributes of principal components analysis (PCA) that make it useful for the analysis of high dimensional data. In elliptical families, many of the useful attributes coincide, but this is not the case for other types of data. Hence, generalizations need to be tailored to the particular use.

Robert Bell, AT&T
Big Data: It's Not the Data

The ability to collect ever larger and more complex data sets raises the potential, often over hyped, for great breakthroughs in many application areas. But big data per se cannot produce new insights. More of the wrong data for the question at hand is simply a bigger headache. Reaping the promised rewards requires careful analysis and solving challenges of a new class of large and complex models. I will introduce some themes that I expect to run throughout the workshop and illustrate them using examples from the Netflix Prize competition.

David Buckeridge, McGill University
Using (Big) Data to Address Challenges in Public Health

A critical lack of population health information poses a barrier to effective public health practice. Novel sources of data can help to address information needs in practice, but these data must be combined appropriately with existing data sources and considered in the context of existing knowledge. In this presentation, I will present examples of how novel data sources are being used to address pressing public health problems in the prevention and control of chronic and infectious diseases.

Hugh Chipman, Acadia University
An overview of Statistical Learning

Like machine learning, the field of statistical learning seeks to ``learn from data''. I will review some of the central ideas that identify statistical learning, including regularization and the bias-variance trade-off, resampling methods such as cross-validation for selecting the amount of regularization, the role of a probability model for data, and quantification of uncertainty. These ideas will be discussed in the context of popular recent approaches to supervised and unsupervised learning.

Sheelagh Carpendale, University of Calgary
Information Visualization: Making Data Accessible

Information Visualization has been defined by Card, Mackinlay and Shneiderman as “The use of computer-supported, interactive, visual representations of abstract data to amplify cognition”. This rather ambitious definition has fueled burgeoning interest in and hope for the potential value of information visualization. This interest has further been amplified by the growing deluge of data, triggering questions about whether Information Visualization can be part of the key to unlocking the possible benefits of big data. In this talk I will discuss successes and challenges of current research in Information Visualization.

Dianne Cook, Iowa State University
Data Visualization and Statistical Graphics in Big Data Analysis
In this information age we are drowning in data. Good data visualisation helps us to swim, digest the data, and learn about our world. In statistical data visualisation plots are designed to support and enrich the statistical processes of data exploration, modeling, and inference. As a result, statistical data visualisation has some unique features which differentiates it from visualisations made in other fields. Statisticians are always concerned with variability in observations and error in measurements, both of which cause uncertainty about conclusions drawn from data. Dealing with this uncertainty is at the heart of classical statistics, and statisticians have developed a huge body of inferential methods that help to quantify uncertainty. Statistical data graphics cover a spectrum of methods including elegant static data visualisations to highly interactive and dynamic graphics used for exploratory data analysis. Technology changes at a rapid pace today, also, and statistical graphics needs to be re-thought to adapt.

Data visualization gives statistical analysis a competitive edge, helping to build better models, more accurate predictions - we'll show two examples where groups have won data mining competitions based on using visualization to better understand the data provided. Visualization enables exploration and discovery of unexpected features in data, which have been traditionally thought of as orthogonal to statistical inference. But they are not mutually exclusive, and it is possible to do inference with data graphics, which will be explained.

Data examples from settings like the tech boom and bust of the late 1990s, food quality control, atmospheric CO2 levels and temperature changes, stimulus fund spending, university ranking, PISA education data, labor market wages, stock market trends, or soybean breeding for agribusiness, will be used.

Charmaine Dean, Western University
Wildfire and Forest Disease Prediction to Inform Forest Management: Statistical Science Challenges

Wildfire is an important system process of the earth that occurs across a wide range of spatial and temporal scales. A variety of methods have been used to predict wildfire phenomena during the past century to better our understanding of fire processes and to inform fire and land management decision-making. Statistical methods have an important role in wildfire prediction due to the inherent stochastic nature of fire phenomena at all scales. Predictive models have exploited several sources of data describing fire phenomena. Experimental data are scarce; observational data are dominated by statistics compiled by government fire management agencies, primarily for administrative purposes and increasingly from remote sensing observations. Fires are rare events at many scales. The data describing fire phenomena can be zero-heavy and nonstationary over both space and time. Users of fire modeling methodologies are mainly fire management agencies often working under great time constraints, thus, complex models have to be efficiently estimated. This talk focuses on providing an understanding of some of the information needed for fire management decision-making and of the challenges involved in predicting fire occurrence, growth and frequency at regional, national and global scales.

The talk also considers the use of novel techniques in forest ecology such as joint outcome modeling of multivariate spatial data, where outcomes include count as well as zero-inflated count data. The framework utilized for the joint spatial count outcome analysis reflects that which is now commonly employed for the joint analysis of longitudinal and survival data, termed shared frailty models, in which the outcomes are linked through a shared latent spatial random risk term. We discuss these types of joint mapping models and consider the benefits achieved and challenges of such joint modeling in a specific ecological study of Comandra blister rust infection of lodgepole pine trees.

Chad Gaffield, University of Ottawa
Big Data vs. Human Complexity: An early status report on the central question of the 21st century

How can the so-called data tsunami enhance knowledge and understanding of human thought and behavior? This question has unexpectedly moved to the top of agendas across businesses, governments and public and non-profit institutions. In the new societal landscape of customer-driven marketplaces, citizen-engaged politics, student-centered schools, and patience-oriented health services, data on people¹s thoughts and actions are the new foundation for policy and practice. As a result, research fields previously considered as secondary or ³soft² are fast becoming central to campus-wide and community-connected initiatives to study people through data analytics. From literary scholars to historians and visual artists, researchers across the humanities and social sciences are now collaborating with computer scientists, neuroscientists, geneticists and other researchers in the natural and health sciences as well as engineering. The early results are stunning and disappointing, insightful and confusing, encouraging and frustrating. This presentation will offer an early status report on successes and setbacks in order to propose possible next steps in meeting the 21st century challenge of making a better future through enhanced knowledge and understanding of human thought and behavior.

Yulia Gel, University of Waterloo
The Role of Modern Social Media Data in Surveillance and Prediction of Infectious Diseases: from Time Series to Networks

The rampant growth of digital technologies has revolutionized the volume, velocity and variety of available information on all aspects of our life, from consumer behavior to health, and one of the key driving forces in this Big Data paradigm is expansion and mobility of social media data. At the same time, despite many promising approaches in modern surveillance methodology, the lack of observations for near real-time forecasting still remains the key challenge obstructing operational prediction and control of high virulent infectious disease dynamics. For instance, even CDC data for well monitored areas are two weeks behind, as it takes time to confirm influenza like illness (ILI) as flu, while two weeks is a substantial time in terms of flu transmission. These limitations coupled with the new possibilities brought by the Big Data paradigm have ignited the recent interest in searching for alternative near real-time data sources on the current epidemic state and, in particular, in the wealth of health-related information offered by modern social media. For example, Google Flu Trends uses flu-related searches to predict a future epidemiological state at a local level, and more recently, Twitter has also proven to be a very valuable resource for a wide spectrum of public health applications. In this talk we will review capabilities and limitations of such social media data as early warning indicator of influenza dynamics in conjunction with traditional time series epidemiological models and with more recent random network approaches accounting for heterogeneous social interaction patterns.

Mark Girolami, Warwick University
Differential Geometric Simulation Methods for Uncertainty Quantification in Large Scale PDE Systems

Characterising uncertainty, from a Bayesian perspective, in computer models comprised of large scale and stiff systems of Partial Differential Equations (PDE) becomes challenging when fine meshes and distributed parameters have to be defined and inferred in an inverse problem setting. This talk presents recent work which exploits the use of geodesic flows on the manifold of statistical models defined by the PDE sensitivity equations to sample from the desired Bayesian posterior distribution over all unknowns. The talk will consider the role that deterministic approximations has to play in this scheme and will illustrate the ideas presented by considering system models of elliptic and parabolic PDEs.

Adam Tauman Kalai, Microsoft Inc
Machine learning and crowdsourcing

We explore various ways crowdsourcing can help machine learning, ranging from simply labeling data to gathering/creating data to selecting features and even to choosing the ultimate problem to solve.

Sallie Keller, Virginia Tech
Building Resilient Cities: Harnessing the Power of Urban Analytics

The city lies at the heart of modern life. By 2030, 5 billion people, or about 60% of the world’s population, are projected to live in cities, up from the current 3.6 billion people. With population concentrations and growth at that level, the policies and programs established in the management of urban and peri-urban areas will have enormous impact on the physical, mental, social, and financial well being of humanity and the quality and diversity of the environment on the planet. Every policy implemented in the city and every activity undertaken, with rare exception, is directed toward achieving at least one of the following four paramount goals: survival and fostering healthy growth, addressing and prevailing in competition for resources, adapting to subtle and major catastrophes and disruptions, and creating and exploiting innovations and opportunities for the benefit of the city and all inhabitants. In this context, resources are critical to the outcomes. Data has become a new and important resource – a new asset class. The presentation focuses on the capture of data streams and the development of analytics to provide guidance (quantitative statistical evidence) for urban decision-making, policies and programs.

Sham Kakade, Miscrosoft Inc.
Non-convex approaches to learning representations

Learning how to represent data (often done via unsupervised methods) is a central challenge in modern data analysis settings, where the objective is to model the interactions between multiple observations utilizing possible hidden causes (or latent factors). Parameter estimation of most natural latent variable models --- hidden Markov models, Gaussian mixture models, latent Dirichlet allocation, and sparse coding --- is often phrased as (seemingly intractable) non-convex optimization procedures. In practice, this leads to the use of various search heuristics (like the EM algorithm), sampling based approaches, or convex relaxations. While in some cases these methods work well in practice, there is little understanding of the behavior of these algorithms in theory, or if the problem of parameter estimation in these models is fundamentally intractable.

This talk will summarize a series of works showing that in fact many of these models are learnable (with polynomial time algorithms), based on natural non-degeneracy assumptions. We will examine notions of statistical, computational, and informational complexity (the latter being akin to the question of what statistics one needs to examine in order to identify the model). We will also briefly examine connections to the methods used for dictionary learning. Finally, implications for representation learning and optimization in deep learning will be discussed.

Eric D. Kolaczyk, Boston University
A Whirlwind Tour of Statistical Analysis of Network Data

Over the past 15+ years, the study of so-called "complex networks" — that is, network-based representations of complex systems — has taken the sciences by storm. Researchers from biology to physics, from economics to mathematics, and from computer science to sociology, are more and more involved with the collection, modeling and analysis of network-indexed data. With this enthusiastic embrace of networks across the disciplines comes a multitude of statistical challenges of all sorts — many of them decidedly non-trivial. In this talk I will present a (very!) brief overview of foundations common to the statistical analysis of network data across the disciplines, from a statistical perspective, in the context of topics like network summary and visualization, network sampling, network modeling and inference, and network processes. Focus will be on highlighting the curious mix of similarities and differences in statistical challenges encountered with network problems, as compared to more traditional data. Concepts will be illustrated drawing on examples from various domain areas, including bioinformatics, computer network traffic analysis, neuroscience, and social media.

Bo Li, University of Illinois at Urbana-Champaign
Reconstructing Past Temperatures using Short- and Long-memory Models

Understanding the dynamics of climate change in its full richness requires the knowledge of long temperature time series. Although longterm, widely distributed temperature observations are not available, there are other forms of data, known as climate proxies, that can have a statistical relationship with temperatures and have been used to infer temperatures in the past before direct measurements. We propose a Bayesian hierarchical model(BHM) to reconstruct past temperatures that integrates information from different sources, such as proxies with different temporal resolution and forcings acting as the external drivers of large scale temperature evolution. The reconstruction method is assessed, using a global climate model as the true climate system and with synthetic proxy data derived from the simulation. Then we apply the BHM to real datasets and produce new reconstructions of Northern Hemisphere annually averaged temperature anomalies back to 1000AD. We also explore the effects of including external climate forcings within the reconstruction and of accounting for short-memory and long-memory features. Finally, we use posterior samples of model parameters to arrive at an estimate of the transient climate response to greenhouse gas forcings of 2:5C (95% credible interval of [2:16, 2:92]C), which is on the high end of, but consistent with, the expert-assessment-based uncertainties given in the recent Fifth Assessment Report of the IPCC.

Han Liu ( Presented by Ethan X. Fang), Princeton University
Testing and Confidence Intervals for High Dimensional Proportional Hazards Model

We propose a decorrelation-based approach to test hypotheses and construct confidence intervals for the low dimensional component of high dimensional proportional hazards models. Motivated by the geometric projection principle, we propose new decorrelated score, Wald and partial likelihood ratio statistics. Without assuming model selection consistency, we prove the asymptotic normality of these test statistics, establish their semiparametric optimality, and conduct power analysis under Pitman alternatives. In addition, we also develop new procedures for constructing pointwise confidence intervals for the baseline hazard function and baseline survival function.

Lisa Lix, University of Manitoba

Chronic Disease Research and Surveillance: The Power of Big Databases, the Challenges of Data Quality

Electronic health databases, including administrative health data, electronic medical record data, and clinical registries, are used extensively in Canada to conduct policy-relevant chronic disease research and surveillance. The popularity of these databases has arisen because they are available in a timely manner, contain information about large numbers of individuals, and are relatively inexpensive to access and use. However, the quality of these databases for research and surveillance has been questioned, and has led to multiple studies on this topic, most of which have focused on the validity of disease diagnostic information. In this talk, I will present a comprehensive data quality framework, describe different statistical methods for evaluating data quality, and illustrate these techniques with numeric examples.


Richard Lockhart, Simon Fraser University
Inference after LASSO -- limits and limitations

I will survey some work with Jon Taylor, Ryan Tibshirani and Rob Tibshirani on inference for the parameters when LASSO or LARS is used to fit a linear regression with many parameters. Under strong sparseness assumptions we have both asymptotic distribution theory and exact finite sample theory for those coefficients picked by some model selection procedures. One concern is power; I will discuss some preliminary ideas about what is possible and try to show how LeCam's contiguity ideas can be relevant at unusual distances from null hypotheses in high dimensions.

Sofia Olhede, University College London
Understanding Large Networks

Networks have become pervasive in practical applications. Understanding large networks is hard, especially because of a number of typical features present in such observations create a number of technical analysis challenges. I will discuss some basic network models that are tractable for analysis, what sampling properties they can reproduce, and some results relating to their inference.


C. Shane Reese, Brigham Young University
From Basis Expansions to Insurgency Prediction: Applications of Bayesian Compressive Sensing

Characterization of data cardinality is no longer limited to numbers of rows and columns as the data streams are complex and arrays are (usually) insufficient to capture the complexity. As data complexity grows with the size of the data, statistical methods must evolve to accommodate the richness of the data as well as the demands for computation in (near) real time. We present several applications of emerging statistical methods to data of moderate complexity and size. The first application involves an infinite near basis expansion of an intractable physics phenomenon with important socio-economic implications in the formation of alloys. The second application illustrates the simultaneous use of model selection and model estimation techniques for the assessment of whether insurgencies in South American nations can be predicted from social media activity. While both applications are of moderate size, the computational efficiency of the algorithms and data analytic tools illustrate the feasibility of application to truly big data.

Ruslan Salakhutdinov, University of Toronto
Recent Advanced in Deep Learning: Learning Structured, Robust, and Multimodal Models

Building intelligent systems that are capable of extracting meaningful representations from high-dimensional data lies at the core of solving many Artificial Intelligence tasks, including visual object recognition, information retrieval, speech perception, and language understanding.

In this talk I will first introduce a broad class of deep learning models and show that they can learn useful hierarchical representations from large volumes of high-dimensional data with applications in information retrieval, object recognition, and speech perception. I will then describe a new class of more complex models that combine deep learning models with structured hierarchical Bayesian models and show how these models can learn a deep hierarchical structure for sharing knowledge across hundreds of visual categories. Finally, I will introduce deep models that are capable of extracting a unified representation that fuses together multiple data modalities as well as discuss models that can generate natural language descriptions of images. I will show that on several tasks, including modelling images and text, video and sound, these models significantly improve upon many of the existing techniques.

Alexandra M. Schmidt , IM-UFRJ, Brazil
An overview of covariance structures for spatial and spatio-temporal processes

An important aspect of statistical modeling of spatial or spatiotemporal data is to determine the covariance function. It is a key part of spatial prediction. The classical geostatistical approach uses an assumption of isotropy, which yields circular isocorrelation curves. In the analysis of most spatio-temporal processes underlying environmental studies there is little reason to expect spatial covariance structures to be isotropic. This is because there may be local influences in the correlation structure of the spatial random process. Adding the temporal aspect, there is often interaction between time and space, requiring classes of nonseparable covariance structures. In this talk we will review some of the alternatives to the usual isotropic models that have been proposed in the last 20 years. We will also discuss some alternatives to massive datasets and models for multivariate and non-Gaussian processes.


Dale Schuurmans, University of Alberta
Convex Methods for Latent Representation Learning

Automated feature discovery is a fundamental problem in data analysis. Although classical feature learning methods fail to guarantee optimal solutions in general, convex reformulations have been developed for a number of such problems. Most of these reformulations are based on one of two key strategies: relaxing pairwise representations, or exploiting induced matrix norms. Despite their use of relaxation,convex reformulations can demonstrate improvements in solution quality by eliminating local minima. I will discuss a few recent convex reformulations for representative learning problems, including robust regression, hidden-layer network training, and multi-view learning --- demonstrating how latent representation discovery can co-occur with parameter optimization while admitting globally optimal relaxed solutions. In some cases, meaningful rounding guarantees can also be achieved.

Steve Scott, Google Inc
Bayes and Big Data: The Consensus Monte Carlo Algorithm

A useful definition of ``big data'' is data that is too big to comfortably process on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be eliminated by splitting data across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging individual Monte Carlo draws across machines. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time. Examples of consensus Monte Carlo are shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).


Therese A Stukel, ICES
Innovative uses of big data for health policy research

Data in health care has been growing rapidly. ICES Ontario holds one of the largest repositories of health administrative data, from physician and hospital records to disease registries and clinical data from electronic medical records. We illustrate innovative linkages of these data to produce novel applications for health policy. These range from creation of Health Service Region boundaries and multispecialty physician networks to online calculators to predict life expectancy and hospital days.


Stephen Vavasis, University of Waterloo
Clique and Biclique: An example of using convex optimization for data mining

The clique and biclique problems are examples of data mining problems that fall into under the general heading of clustering. We explain how to describe these problems as nonconvex optimization, how to convexify them, what guarantees hold for the convexification, and how to solve the convex versions. Parts of this talk represent joint work with B. Ames. In addition, part of the talk will survey a paper by Chen and Xu.

Martin Wainwright, University of California, Berkeley
Statistics meets optimization: Rigorous guarantees for solving nonconvex programs

In this talk, we provide an overview of an evolving line of work on rigorous results for different types of nonconvex optimization problems that arise in statistical machine learning. Nonconvex optimization is known to be intractable in the general setting, but the problems arising in the statistical setting are not adversarially constructed, but rather related to a population oracle problem that can be well-behaved. We discussing how this intuition can be made precise for a number of different problems, including high-dimensional sparse regression with nonconvex regularization, as well as various applications of the EM algorithm to latent variable models with nonconvex likelihoods.

Patrick J. Wolfe, University College London
Statistics meets optimization: Rigorous guarantees for solving nonconvex programs
(joint work with Sharmodeep Bhattacharyya and Peter J. Bickel)

Exchangeable network models provide a general nonparametric class of models for unlabeled random graphs, based on a latent variable formulation. Typically these models are approximated by fitting stochastic blockmodels, and the corresponding latent variable densities are estimated using piecewise-constant or histogram-type estimators derived from the fitted blockmodel. We show that if the latent variable density has certain smoothness properties and the fitted blockmodel satisfies certain consistency properties, then it is possible to obtain consistent estimators for the equivalence class of latent variable densities under suitable metrics. We derive rates of convergence for a fitted blockmodel to estimate the latent variable density, depending on the smoothness properties of the density as well as the error rate of the method of blockmodel fitting. We also propose a cross-validation method using count statistics and their smooth functions to regularize the fitted blockmodel, and to choose the number of blocks to approximate the latent variable density. The cross-validation method provides us with a procedure to regularize the fitted blockmodel, independently of the algorithm used to fit the blockmodel itself.

Douglas G. Woolford, Wilfrid Laurier University
Exploratory data analysis, visualization and modelling methods for large data in forest fire science

Exploratory data analysis uses visualization techniques to explore for structure in data sets, as well as departures from such structure. It is an important first step, prior to modelling and inference, when dealing with environmental data sets which are commonly large, complex and spatio-temporal in nature. For example, data on forest fire occurrences, fire-weather, and cloud-to-ground lightning strikes are recorded as spatially-referenced time series and such information is commonly recorded at a fine scale. For the first part of this talk I will discuss several exploratory techniques for visualizing large data in the context of forest fire science, including: 1) a data filtering technique used to reveal dominant data surfaces, track the paths of storm centres, and visualize typical storm systems; 2) an interactive, 3D stalagmite plot for exploring patterns in slices of long time series; and, 3) a singular value decomposition method that is useful for exploring spatial patterns in areal data. In the context of predicting rare events in large data, I will then discuss how sampling observations on a fine spatio-temporal grid can create a smaller, more computationally manageable data set when using a logistic generalized additive model to approximate the conditional intensity function of an underlying point process. I will also illustrate how a well-designed sampling scheme can lead to more efficient estimates in such models.