July 20, 2024


Thematic Program on Statistical Inference, Learning, and Models for Big Data, January to June, 2015

February 23 – 27, 2015
Workshop on Visualization for Big Data: Strategies and Principles
Organizing Committee

Nancy Reid (Chair), Susan Holmes
Snehelata Huzurbazar,
Hadley Wickham
Leland Wilkinson

Speaker Abstracts




Katy Börner, Indiana University
Envisioning Science, Technology, and Innovation

Recent developments in data mining, information visualization, and science of science studies make it possible to study science, technology, and innovation (STI) at multiple levels using a systems science approach. At the micro-level, the impact of single individuals, specific works, or legal frameworks can be examined. At the meso-level, the expertise profiles of institutions can be compared or the trajectories of student cohorts can be modeled. The macro-level provides a 10,000 foot view of the continuously evolving geospatial and topical landscape of science and technology and the global import/export activities, innovation diffusion, and brain circulation unfolding over both spaces.

This talk features STI visualizations including maps from the international Places & Spaces: Mapping Science exhibit (, the Atlas of Science, and the Atlas of Knowledge. Interested to empower many to not only read but also make data visualizations, I will then present a visualization framework that was developed to guide the design of complex data analysis and visualization workflows as well as the development of macroscope tools ( Both, the framework and the macroscope tools, are taught in the Information Visualization MOOC (, now in its 3rd year) that empowers students from 100+ countries to convert data into insights.


· Börner, Katy. 2010. Atlas of Science: Visualizing What We Know. The MIT Press.

· Scharnhorst, Andrea, Katy Börner, and Peter van den Besselaar, eds. 2012. Models of Science Dynamics: Encounters Between Complexity Theory and Information Science. Springer Verlag.

· Börner, Katy, and David E. Polley. 2014. Visual Insights: A Practical Guide to Making Sense of Data. Cambridge, MA: The MIT Press.

· Börner, Katy. 2015. Atlas of Knowledge: Anyone Can Map. The MIT Press.


Jenny Bryan, University of British Columbia
New tools and workflows for data analysis

In the past several years, there have been exciting additions to the toolkit for statisticians and data analysts who work in R. Examples include RStudio, the R markdown format for dynamic reports, the Shiny web application framework, and improved integration with Git(Hub). The downside, of course, is the potential agony associated with mastering new tools and developing new workflows. Change is hard. I will give an overview of these developments and describe the costs and benefits associated with adopting new approaches to data exploration and analysis. I will also share my very positive and illuminating experiences from teaching these new tools in several graduate courses.

Sheelagh Carpendale, University of Calgary
Information Visualization: Exploring New Options

Much of the excitement in the early 1990s about information visualization originated in the idea of creating new visual, spatial representations that would allow people to ‘see’ their data. Much was said about the amount of the brain that is devoted to spatial and visual reasoning and how visualizations might have the power to utilize these relatively untapped resources. However, as information visualization research has progressed, a degree of practically has emerged – heightening a focus on usability and task enablement. As important as this focus maybe, there may still be something worth investigating in the notion of alternate representations. In this talk, I will explore the possible power of alternate interactive visual representations by considering ideas around innovation.

Remco Chang, Tufts University
Big Data Visual Analytics: A User-Centric Approach

Modern visualization systems often assume that the data can fit within the computer's memory. With such an assumption, visualizations can quickly slice and dice the data and help the users examine and explore the data in a wide variety of ways. However, as we enter the age of Big Data, the assumption that data can fit within memory no longer applies. One critical challenge in designing visual analytics systems today is therefore to allow the users to explore large and remote datasets at an interactive rate. In this talk, I will present our research in approaching this problem in a user-centric manner. In the first half of the talk, I will present preliminary work with the database group at MIT on developing a big data visualization system based on the idea of predictive prefetching and precomputation. In the second half of the talk, I will present mechanisms and approaches for performing prefetching that are based on user's past interaction histories and their perceptual abilities.

Ta Chiraphadhanakul, Facebook Inc.
Visualizing Big Data at Facebook

There are over 300 petabytes of data at Facebook. It is difficult, if at all possible, to visualize every single data point in these gigantic data sets. I will provide a brief overview of Facebook data infrastructure and present the key technologies that enable us to query subsets of data quickly and perform interactive visual analysis. Through a lot of examples from our work, I will discuss how data scientists and researchers transform big data into valuable information and insightful visualization.


Christopher Collins, UOIT
Semantics and Sentiment in Visual Text Analytics

How do people feel about my product? What are the main themes in the news today? These are examples of the questions people ask about large scale text data. Visual text analytics tools are being created to help address these challenging questions. In this talk I will review recent research advances for exploring and analyzing sentiment in text, and extracting meaning and relationships between entities in text. For those visualization designers who want to take advantage of semantics and sentiment, this session will also cover natural language processing toolkits and data resources.

Dianne Cook, Iowa State University
A Kaleidoscope of Statistical Graphics Research Projects

This will be a series of short talks displaying a range of current projects in data visualization research from Iowa State University and Purdue University.
-Adding a new geom to ggplot2, in order to extend the widely used software's network plotting capacity. Samantha Tyner
- Exploring networks using D3 and shiny in the software gravicom. Andee Kaplan
- Creating interactive web graphics from ggplot2 using the R package animint. Carson Sievert
- Exploring temporal data by interactively slicing and dicing, using the R package cranvastime. Xiaoyue Cheng
- Trelliscoping big data. Barret Schloerke


Michael Friendly, York University
Big Data and Big Questions: Vignettes and lessons from the history of data visualization

This talk traces some key developments in the history of data visualization to important scientific and social questions of their time and the availability of relevant data: the Big Questions and Big Data of a given era. I try to present this rich history in terms of a few vignettes. They range from problems of geodesy and navigation that led to the first statistical graph, to problems of crime and other "moral variables" that led to foundation of modern social science, to graphs in the Golden Age of statistical graphics designed for state planning and leading to what is arguably the most notable scientific discovery of all time based on a purely visual analysis.

Alex Goncalves, Columbia University
New Trends in Data Visualization and Journalism

In a moment of tectonic shifts in the media industry, news media has been looking for new strategies to captivate and engage the public. The combination of data, visualization, and story telling is certainly one of the most promising areas for media innovation. I will present some successful examples that suggest the potential of data visualization for informing public opinion and promoting social debate.

Susan Holmes, Stanford University
Transformations before Visualization of Heterogeneous Data

Modern data presents many layers of heterogeneity. I will speak about how careful data transformations can make visualization more effective. Using examples from current research on the Human Microbiome I will show several examples of data transformations that result in more meaningful graphics for complex heterogeneous data that combines phylogenetic trees, OTU contingency tables, clinical data and community networks.

I will show an implementation of these ideas in a Bioconductor package (phyloseq) and its browser based extension (Shiny-phyloseq).

This talk contains joint work with Joey McMurdie.

Ekaterina Smirnova, University of Wyoming
Visualization of Multidimensional Data with Different Structures

Recent advances in various ``omics'' technologies allow for comprehensive examination of microbial communities, together with other data such as those obtained on lipids and cytokines, along with usual clinical (covariate) data, all collected on the same individual. Often the goal is to relate these data and explore their connections to various diseases. A characteristic of these studies is that the data are often quite sparse but collected on a large number of variables. As a start, exploring any structure present in the data is essentially. Eigendecomposition-based methods provide tools for not only representing data in lower dimensions, but also enable systematic integration and comparison of multi-omic data sets.

Using data from a vaginal microbiome project, we address statistical methods for big data visualization, for exploring temporal dynamics of the vaginal microbiome, and for integrating three types of ``omics'' data, namely, microbial data obtained from 16S sequencing, data on some cytokines and some lipids collected on the same women. In particular, correspondence analysis allows for exploring whether women with bacterial vaginosis can be grouped based on the taxa present in their vaginal microbiome samples obtained via 16S sequencing, co-inertia analysis provides a tool for coupling the taxa measurements over two subsequent clinic visits, and multiple co-inertia analysis helps visualize connections between the three different data types.

Ramnath Vaidyanathan, McGill University
Interactive Visualizations from R

In this talk, I will discuss an approach to create, customize and share interactive visualizations straight from R, using a consistent interface, leveraging existing javascript visualization libraries. The advent of visualization frameworks like d3.js and raphaeljs, have made it easier to create sophisticated interactive visualizations. However, they still require an intermediate knowledge of javascript and web development tools, making it harder to use for data scientists, who often spend a lot of their time analyzing data using languages like R/Python/Julia.
The main motivation behind this work is to provide data scientists a seamless workflow that allows them to execute all steps of the data visualization process, from acquiring data to exploring it, visualizing it, and sharing their results, without having to leave the comfort of their primary language for data analysis (R/Python).

As a part of this talk, I will discuss three R packages: htmlwidgets, a framework to create R bindings to javascript libraries (jointly authored with RStudio and Kenton Russell), rCharts, consistent plotting interface to leverage several popular interactive visualization libraries (jointly authored with Kenton Russell and Thomas Reinholdsson), and rMaps.

Hadley Wickham, R Studio
# Pipelines for big data

When thinking about visualising big data, the visualisation challenges play a surprisingly small role. Instead, much of the challenge is ingesting the data, tidying it into a workable format and transforming it to the appropriate level of aggregation (so you _can_ see the forrest for the trees). In this talk, I'll discuss my recent work that ties together R packages for working with for big data, first loading it, then tidying it, then transforming it and finally visualising it.

Leland Wilkinson, Tableau Software
Exploring huge collections of images

Tuan Nhon Dang and Leland Wilkinson

We introduce a method for guiding interactive exploration of a huge corpus of images. The method is based on scagnostics - nine characterizations of the 2D distributions of orthogonal pairwise projections on a set of points in multidimensional Euclidean space. These characterizations include measures such as, density, skewness, shape, outliers, and texture. Working directly with these measures, we can locate anomalies for further analysis or search for similar distributions in a large corpus of images.

Bowei Xi, Purdue University
Divide and Recombine (D&R) for Larger and More Complex Data

D&R is a statistical approach to the analysis of large complex data. The
goals are the following: (1) Provide the data analyst with methods and a computational environment that enable study of large data with almost the same comprehensiveness and detail that we can small data. (2) The analyst uses an interactive language for data analysis that is both highly flexible and enables highly time-efficient programming with the data.
(3) Underneath the language, a distributed database and parallel compute engine makes computation feasible and practical, and is easily addressable from within the language. (4) The environment provides access to the 1000s of analytic methods of statistics, machine learning, and visualization.
(5) Get a reasonable first system going right away.

In D&R, the analyst divides the data into subsets. Computationally, each subset is a small dataset. The analyst applies analytic methods to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with almost no communication among them, making them nearly embarrassingly parallel, the simplest possible parallel processing.

One of our D&R research thrusts uses statistics to develop ``best'' division and recombination methods for analytic methods. This is critical because the division and recombination methods have an immense impact on the statistical accuracy of the D&R result for an analytic method.

Another thrust is the D&R computational environment Tessera ( The front end is R. The back end is the the Hadoop distributed database and parallel compute engine. Our Tessera software manages the communication between front and back, enabling the analyst to program D&R wholly from within R, insulated from the complexity of distributed database management and parallel computation.


Back to top