SCIENTIFIC PROGRAMS AND ACTIVITIES

March 28, 2024

THE FIELDS INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES

Thematic Program on Statistical Inference, Learning, and Models for Big Data January to June, 2015

April 13 – 16, 2015
Workshop on Big Data for Social Policy

Organizing Committee
Sallie Keller (chair),
Robert Groves,
Mary Thompson

Abstracts

Ana Aizcorbe, Virginia Tech
Leveraging "big data" to improve official measures of health care: Lessons from a new Health Satellite Account for the US

In January, the Bureau of Economic Analysis released a set of statistics for health care spending for the US to supplement those in the official national accounts. These estimates are viewed as experimental in large part because the data sources are novel and not completely understood. This talk will detail some of the challenges faced when trying to combine the large number of observations available in health insurance claims data bases with the sound statistical properties of government surveys.

 

 

Hélène Bérard, Statistics Canada
A Suggested Framework for National Statistics Offices for Assessing the Quality of Big Data

In April 2013, the UNECE Expert Group on the Management of Statistical Information Systems (MSIS) identified Big Data as a key challenge for official statistics, and called for the High-Level Group for the Modernisation of Statistical Production and Services (HLG) to focus on the topic in its plans for future work. As a consequence, the project The Role of Big Data in the Modernisation of Statistical Production was undertaken in 2014. The project comprised four 'task teams', addressing different aspects of Big Data issues relevant for official statistics: the Privacy Task Team, the Partnerships Task Team, the Sandbox Task Team, and the Quality Task Team.
The Quality Task Team was asked to develop a preliminary framework for national statistical offices to conceptualise Big Data quality. Building on dimensions and concepts from existing statistical data quality frameworks, the Big Data Quality framework developed provides a structured view of quality at three phases of the business process:
" Input - acquisition, or pre-acquisition analysis of the data;
" Throughput -transformation, manipulation and analysis of the data;
" Output - the reporting of quality with statistical outputs derived from big data sources.
The framework is using a hierarchical structure composed of three hyperdimensions with quality dimensions nested within each hyperdimension. The three hyperdimensions are the source, the metadata and the data. The concept of hyperdimensions has been borrowed from the administrative data quality framework developed by Statistics Netherlands.
This talk describes the findings of the Big Data Quality Task Team, the general principles proposed, the approach taken for the development of quality indicators, as well as suggested future directions.

 

 

Pedro Ferreira, Carnegie Mellon University
Large-Scale Organic In-vivo Randomized Experiments in Network Settings

This talk will describe will describe a series of real-world experiments involving more than half a million households in which individuals participate organically. These experiments have been designed, deployed and analyzed in collaboration with a major European cable provider. In one experiment I show that likes help correct social bias in the consumption of video-on-demand. In another experiment, I show that word-of-mouth propels the sales of video offered at discounted prices. Still in this context, I will also show that firms have incentives to manipulate recommendations in ways that may not maximize consumer welfare. In a similar context, another experiment shows that video streaming services over the Internet substitute for TV consumption roughly at a 1:1 rate and that users offered access to movies for free for a while do not necessarily stop pirating copyrighted content. Finally, I will also show results from an experiment measuring the economic benefits from pro-active churn management targeting groups of users instead of single individuals.

 

Stephen Fienberg, Carnegie Mellon University
Overview of Living Analytics Research Centre and its Activities

The Living Analytics Research Centre is is a joint research initiative between Singapore Management University (SMU) and Carnegie Mellon University (CMU) focused on consumer and social analytics for the network-centric world. Working with a set of commercial partners and in an on-campus testbed in Singapore, LARC researchers are developing new concepts, methods, and tools that are experiment-driven, closed-loop, more real-time, and practical at societal scale. I will provide an introduction to the LARC research activities and the types of data with which we work. I will also describe a series of “big data” statistical challenges arising in our work.

 


Robert R. Groves, Georgetown University
Moving from the Sample Survey Paradigm to a Blended World with High-Dimensional Data

For the last 75 years most information about human societies and their economies have been generated from probability samples of near-universal frames of population units. These samples were subjected to standardized measurement, the data cleaned and assembled, and statistical analyses performed for statistical inference to the full population.
In most developed countries of the world, these sample surveys are becoming increasingly expensive, due to the difficulties to contact and gain cooperation of the households, businesses, or groups being measured. Given their large costs, few countries are investing in new survey infrastructure.
Simultaneous to the threats to sample surveys is the rise of new data resources, vast in numbers of observations, but not under the control of the researcher. The weaknesses of these new data sources are often the strengths of the sample survey method, and vice versa. These data, labeled "organic data," when combined with traditional sample survey data, might offer a way forward for the social sciences.
The presentation discusses key principles and obstacles of needed research in this area.

 

Mike Holland, Center for Urban Science and Progress, New York University
Privacy Challenges Arising from the Use of Big Data for Urban Science

The ability to merge administrative data, in situ sensor data, images, and citizen-generated data promises to improve our understanding of cities with unprecedented granularity, coverage, and timeliness. Local governments are exploring how to make better use of such data to help them deliver services more efficiently, increase the precision and accuracy of enforcement actions, set more informed policies, or more effectively plan infrastructure improvements. Privacy concerns arise not only over how the data is generated, but also as a result of where the data is generated, collected, contributed and how it is correlated with other data – whether by agencies, corporations, researchers, or by individuals. This talk will focus on some of the practical challenges facing use of big data for urban science.

 

Gizem Korkmaz, Virginia Tech
Insurgency Prediction Using Multiple High Volume Social Media Data Sources

Civil unrest events (protests, strikes, etc.) unfold through complex mechanisms that cannot be fully understood without capturing social, political and economic contexts. People revolt for economic reasons, for democratic rights, to express their discontent. Others are driven by unions, political groups, and some through social media. Modern cultures use a variety communication tools (e.g., traditional and social media) to coordinate in order to gather a sufficient number of people to raise their voice. By combining different data sources, the number and diversity of data sources, we aim to explore what possible sources of information carry signals around key events in countries of Latin America. To accommodate dynamic features of social media feeds and their impacts on insurgency prediction, we develop a dynamic linear model based on daily keyword combinations. In addition, due to the large number of so-called n-grams, we employ a sparseness encouraging prior distribution for the coefficients governing the dynamic linear model. Included in the predictors are significant sets of keywords (~1000) extracted from Twitter, news, and blogs. In addition, we also include volume of requests to Tor, a widely-used anonymity network, economic indicators and two political event databases (GDELT and ICEWS). Insurgency prediction is further complicated by the difficulty in assessing the exact nature of the conflict. Our study is further enhanced by the existence of a ground truth measure of conflicts compiled by an independent group of social scientists and experts on Latin America.

 

Julia Lane, America Institutes of Research
Big Data, Privacy, and the Public Good: Frameworks for engagement

The recent Cambridge University Press book "Privacy, Big Data, and the Public Good: Frameworks for Engagement" addresses privacy concerns over the use of big data for commercial and intelligence purposes, and describes how these data can harness public good. This presentation provides a summary of the legal, economic and statistical thought that frames the many privacy issues associated with big data use and that are highlighted in the book. The presentation will also provide a summary of the book's practical suggestions for protecting privacy and confidentiality that can help researchers, policymakers and practitioners.

 


Pierre Lavallée, Statistics Canada
Sample Matching: Toward a probabilistic approach for Web surveys and big data?

More and more, we see survey firms using respondents from Web panels to conduct their surveys. Most of the time, these panels cannot be considered as probabilistic samples, and consequently, inference from these panels is often questionable. In order to introduce a probabilistic component to Web panels, Rivers (2007) proposed Sample Matching. This method consists in selecting a probabilistic sample from a sampling frame, and then in associating this sample to the respondents from the Web panel by using statistical matching. With statistical matching, each individual of the probabilistic sample is matched to one of the respondents of the panel according to some given characteristics, without aiming at having two individuals corresponding exactly to the same person. In this presentation, we will first describe in details Sample Matching. Second, this method will be compared to Indirect Sampling (Lavallée, 2007). Third, we will present the hypothesis and theoretical justifications underlying Sample Matching. Finally, we will present some example of the application of the method.

 

Eric Miller, University of Toronto
Agent-Based Microsimulation of Urban Spatial Socio-Economic Processes: Current Status, Future Prospects and the role of Big Data

Over the past ten years agent-based microsimulation has increasing become the state of the art in large-scale urban travel demand forecasting model systems, as well as (to a lesser extent) integrated urban (land-use) modelling. At the same time, “big data” with transportation and urban form applications are becoming increasingly available. This paper begins by providing an introduction to and overview of urban systems modelling. It then discusses the current and emerging state of big data sources and applications in transportation planning. This leads to a discussion of the opportunities and challenges involved in big data applications within urban systems modelling and some recommendations for directions for future research.

 

Archan Misra, Singapore Management University
Mobile Analytics@LiveLabs: Studying Human Behavior in Urban Public Spaces

This talk will describe the analytics opportunities and challenges arising from the LiveLabs testbed on the SMU campus, which seeks to utilize longitudinal traces of mobile sensing and usage data to understand human behavior in the physical world. I shall first outline our work on inferring a variety of on-campus physical world behavior, including indoor movement, group-based interactions and queuing episodes. Additionally, I will show how the inferred physical world group interaction context proves useful for multiple higher-level objectives, including (a) establishing the social relationships ("strength of ties") among the participants and (b) identifying anomalous or unusual events in such public venues. Next, I shall demonstrate an operational platform that supports context-based in-situ experimental studies on LiveLabs participants (e.g., understanding the efficacy of promotions targeted to users who are "queuing as a group"). The uncertainty in the inferred context, however, poses a fundamental challenge to establishing the validity of both the analytical insights and the experimental outcomes. I shall conclude by summarizing our current approaches and open issues related to this challenge.

 

Ron Jarmin, US Census
The Value Chain and Impact of University Research: A Prototype for Modernizing Economic Statistics
This presentation describes a new project at the U.S. Census Bureau in collaboration with academic researchers that links university administrative data on federally funded research to Census Bureau data on businesses and workers. The university data are far richer than that typically obtained from organizations on surveys or via government administrative data (e.g., tax records). I’ll describe these data, how they are linked to Census Bureau data on the universe of businesses and workers and discuss practical and methodological challenges. I’ll present some preliminary results including measures of the entrepreneurial behavior of scientists and engineers. Finally, I describe how this model might be generalized to other sectors of the economy to radically transform and modernize economic measurement.

 


Nathaniel Osgood, University of Saskatchewan
Cross - leveraging Systems, Data and Computational Science for Health Behavioural Insight

While agent-based and hybrid dynamic modeling supports formulation of highly articulated theories of human behavior – including textured dynamic hypotheses concerning social and environmental influences on decision-making, ‘through-the-skin” coupling of physiological and decision-making and environmental interactions – traditional data collection tools and survey instruments are all too frequently provide insufficient evidence to thoroughly inform and ground – much less fully parameterize or calibrate – such models. Fortunately, the rise of big data -- and especially its enhanced velocity, variety, veracity, volume -- provide remarkable volumes of data for researcher, clinical- or self- insight into health behavior, conditions, and health care service delivery processes. Unfortunately, absent rich models that connect the observations to decision-making needs, the value of such data will be severely constrained. Within this talk, we highlight the synergistic character of dynamic models and data drawn from such novel data sources -- including our iEpi smart-phone based sensor, survey and crowdsourcing health platform. We also emphasize particular avenues by which such data allows for articulated theory building regarding difficult-to-observe aspects of human behavior. In this regard, we highlight the synergistic role that dynamic modeling can play in interacting with a range of machine learning and computational statistics approaches in -- including batch Markov Chain Monte Carlo methods, Particle Filtering, Bayesian conditional models and Hidden Markov Models.

 

Jonathan Ozik,Argonne National Laboratoy
Agent-based Modeling and Big Data

With the increased availability of micro-level data and large computing resources, disaggregated modeling approaches such as agent-based modeling (ABM) have the potential to transform the way we understand critical complex systems. In this talk I will cover two sides of ABM and Big Data. First, ABMs can make very good use of micro-data, both in terms of using the data to derive agent attributes and also to infer agent behaviors. In this way ABMs can be seen as ideal Big Data ingesters. I will present some examples of how micro-data is being used to seed large scale ABMs.

Second, ABMs are increasingly run as ensembles on large computational resources. Large scale scientific workflows are being developed to implement capabilities for methods such as adaptive parametric studies, large scale sensitivity analyses, scaling studies, optimization/metaheuristics, uncertainty quantification, data assimilation, and multi-scale model integration. These methods are targeting both heterogeneous computing resources as well as more specialized supercomputing environments, enabling many thousands to millions of ABM simulations to be run and analyzed concurrently. From this perspective, ABMs can be seen as generators of Big Data. I will present work on the development of large scale scientific workflows for use with ABMs.

 

Jerry Reiter, Duke University
Making Large-Scale, Confidential Data Available for secondary Analysis

Large-scale databases from the social, behavioral, and economic
sciences offer enormous potential benefits to society. However, as most stewards of social science data are acutely aware, wide-scale dissemination of such data can result in unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises--and in some instances laws--to protect data subjects' privacy and confidentiality. I this talk, I describe a vision for an integrated system for disseminating large-scale social science data. The system includes (i) capability to generate highly redacted, synthetic data intended for wide access, coupled with (ii) means for approved researchers to access the confidential data via secure remote access solutions, glued together by (iii) a verification server that allows users to assess the quality of their analyses with the redacted data so as to be more efficient with their use of remote data access. I describe some of the methodological challenges to releasing synthetic data and allowing automated verification.

 

Stephanie Shipp, Virginia Tech
Policy meets Social and Decision Informatics

The exponential growth of digital data has created an all data revolution that is allowing us to view the world at a scale and level of granularity that is unprecedented. A similar revolution occurred in the 1930s with the emergence of regularly conducted surveys and probability sampling primarily by federal statistical agencies. Though the timeframe for these survey data are constrained to a monthly, quarterly, or annual basis, the surveys became the primary source of data for large-scale social science research, and over the 80 years we developed principles for managing, analyzing, interpreting, and applying these data that have come to feel intuitive. Digital data are now providing a highly detailed window into our lives on a daily and even minute-by minute basis. The implications for policy are both exciting and challenging. We are offered opportunities to inform social policy development through new insights into (1) how individuals and organizations make choices using, for example, a combination of credit card transactions, GPS tracking, and demographic data and (2) how the opinions, preferences, and interests of individuals interact in collective decisions using social media data. We are challenged to manage transparency, quality, and representativeness of the data. As the all data revolution provides new sources of data to inform policy, it simultaneously requires policy changes. These policy changes will need strategic statistical thinking and innovation to develop pragmatic solutions to use these data for social policy. We lack the 80 years of principles to guide in the reasonable (i.e., scientific), objective, and sensitive management and application of these data to social policy development. These closing remarks to this workshop on Big Data for Social Policy will summarize the opportunities and pitfalls of these data and challenge participants to contribute to the development of those principles.

 

Aleksandra (Sesa) Slavkovic, Penn State
Differentially Private Exponential Random Graph Models and Synthetic Networks

We propose methods to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network. Proposed techniques aim at fitting and estimating a wide class of exponential random graph models (ERGMs) in a differentially private manner, and thus offer rigorous privacy guarantees. More specifically, we use the randomized response mechanism to release networks under e-edge differential privacy. To maintain utility for statistical inference, treating the original graph as missing, we propose a way to use likelihood based inference and Markov chain Monte Carlo (MCMC) techniques to fit ERGMs to the produced synthetic networks. We demonstrate the usefulness of the proposed techniques on a real data example. (Joint work with V. Karwan and P. Krivitsky).

 


Mary Thompson, University of Waterloo
Big data, official statistics and survey science

Following a brief discussion of each of the terms in the title, the talk will focus on the aim of timely and accurate descriptions of survey or census populations. Three issues will be discussed in some detail: (i) the combination of data from survey and other sources, for example to describe the emerging market for Vaporized Nicotine Products; (ii) the promise and perils of population registries as frames; (iii) the challenges of using network structures in population sampling.


Stanley Wasserman, Indiana University
Using Correspondence Analysis to Attack Big Network Data

In networks, nodes represent individuals and relational ties among individuals can indicate joint participation in one or more teams. This representation captures the overlapping team membership but unfortunately fails to preserve the team structures. We propose an alternative promising approach --- using affiliation networks to represent teams and individuals, with "links" representing team membership. There are multiple models to represent affiliation networks. And, this approach is easily adapted to "big network data".

The past decade has seen considerable progress in the development of p* (also known as Exponential Random Graph) models to model these relationships. However, given the plethora of parameters afforded by these models, it is increasingly evident that specification of these models is more of an art than a science. Ideally social science theory should guide the identification of parameters that map on to specific hypotheses. However, in the preponderance of cases, extant theories are not sufficiently nuanced to narrow down the selection of specific parameters. Hence there is a pressing need for some exploratory techniques to help guide the specification of theoretically sound hypotheses.

In order to address this issue, we propose the use of correspondence analysis, which enables us to incorporate multiple relations and attributes at both individual and team levels. This preempts concerns about independence assumptions. The results from correspondence analysis can be presented visually. With these advantages, correspondence analysis can be used as an important exploratory tool to examine the features of the dataset and the relations among variables of interest. We present the theory for this approach, and illustrate with an example focusing on combat teams from a fantasy-based online game, EverQuest II - a very large network. We explore the impact of various individual level and team level attributes on team performance while considering team affiliations as well as social relations among individuals. The results of our analysis in addition to offering important multilevel insights, also serve as a stepping stone for more focused analysis using techniques such as p*/ERGMs.

 

Michael Wolfson, University of Ottawa
Using Agent-Based Models for Social Policy - from LifePaths to THIM

Agent-Based Models (ABMs) form an important tool for social and related policy analysis. In this presentation, we focus on two models at opposite ends of a spectrum of empirical, applied and complicated but not complex at one end, and more theoretical and complex at the other. The first data-intensive example is the Statistics Canada LifePaths longitudinal microsimulation model, which has recently been used to inform the current policy debate about the adequacy of Canada's retirement income system. The second is a formally much simpler complex systems model, the Theoretical Health Inequality Model, THIM. Both models are realized as dynamic computer microsimulation models. While LifePaths is highly detailed and complicated, weaving together the results of many diverse statistical analyses, it is not complex in the sense of possibly generating "emergent" phenomena. It provides detailed estimates of the distribution of disposable income Canada's baby boomers can expect 20 years hence, for example. THIM, on the other hand, can be described by only a handful of equations. But because these equations relate variables at all of individual, family, neighbourhood and city-wide levels, the behaviour of the system can be far less intuitive. Still, THIM can be used to explore aspects of social policy where much of the needed data simply do not exist. An example is the possible explanation of the observed contingent correlation between income inequality and mortality rates among cities - a correlation observed in the US, but not in Canada.

 

 

 

 

 

 

 

 


Back to top