SCIENTIFIC PROGRAMS AND ACTIVITIES

April 24, 2024

THE FIELDS INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES

Thematic Program on Statistical Inference, Learning, and Models for Big Data , January to June, 2015

March 23-27, 2015
Workshop on Big Data in Health Policy
Organizing Committee

Lisa Lix (Chair)
Constantine Gatsonis
Sharon-Lise Normand

Therese Stukel

Abstracts

Arlene Ash, University of Massachusetts Medical School
Seeking needles of health wisdom in haystacks of big data

My purpose will be to imagine a world in which we have overcome the many political and technical obstacles that my co-panelists will be discussing, and to try to envision the kinds of data views and interactions that could enable individuals, clinicians, health systems, communities and governments to participate in a "learning laboratory" where we simultaneously draw on, and contribute to, an evolving body of evidence about how to achieve and maintain human health. I will consider points of comparison and contrast with the extraordinary success of Google, Amazon and others, in mining huge, unstructured streams of data to extract valuable business intelligence.

Peter Austin, Institute for Clinical Evaluative Sciences
Propensity score methods for estimating treatment effects using observational data

Participants will be introduced to the concept of the propensity score, which allows for estimation of the effect of treatment when using observational data. Four different methods of using the propensity score will be discussed: matching, weighting, stratification, and covariate adjustment. The strengths and limitations of the four approaches will be highlighted and their use will be illustrated using a case study. We will contrast the use of propensity score-based methods with conventional regression-based approaches.


Dan Chateau, University of Manitoba
Implementing a research program using data from multiple jurisdictions: The Canadian Network for Observational Drug Effect Studies (CNODES)


Prescription drugs constitute the most common form of treatment used in clinical practice. Despite the pre-approval randomized controlled trials, many questions remain unanswered about their effects, whether unintended, harmful or beneficial. In the absence of randomization, the epidemiologic approach lends itself to the study of the real-life effects of these drugs in the natural setting of clinical practice, often quite different from the experimental setting in which they were developed. Individual initiatives in single jurisdictions cannot always address some new challenges that require larger databases for the study of rare and serious adverse events, for the study of drugs used for infrequent diseases, and to study medications early after they enter the market. To address these challenges, the Canadian Network for Observational Drug Effect Studies (CNODES) was created, with funding from the Drug Safety and Effectiveness Network (DSEN). A pan-Canadian collaboration assembling over 60 scientists from across the country, we use existing healthcare databases on over 30 million people along with powerful analytical methods, to rapidly evaluate the risks and benefits of medications. The CNODES network’s structure has given rise to several methodological issues and challenges in estimating effects of medications from such a distributed network. The presentation will provide an overview of the CNODES network and a typical study, present results of several CNODES studies, describe methodological problems relevant to the CNODES network, and ongoing methodological research and potential solutions.



Constantine Gatsonis, Brown University
Diagnostic test assessment using large databases

Modalities for diagnosis and prediction are used widely and account for a significant fraction of health care expenditures. The study of the impact of these modalities on subsequent process and outcomes is increasingly turning to the use of registries and large administrative and clinical databases. In this presentation we will discuss the experience from a large national registry, the National Oncology PET Registry, and current research combining information from registries and large administrative data bases. In keeping with the "big data" theme of the workshop, we will also discuss current research on developing large "in silico" trials to assess the performance of imaging modalities. These trials utilize simulated imaging scans representing populations of interest, and modalities for image display and interpretation.


David Henry, Dalla Lana School of Public Health, University of Toronto, and Institute for Clinical Evaluative Sciences
Being realistic about how Big Data can support health and health policy research

Much rhetoric has been directed at the ‘transformative’ benefits of Big Data on health care delivery evaluation and planning. Claims have ranged from complete genome sequencing- guided ultra-precise medical treatments to real time evaluation and management of health system performance. In between these are forecasts that we will see the demise of the randomised clinical trial – replaced by bias-free analyses of treatment effects in large population data-bases. Unquestionably, access to high throughput ‘multi-omics’ analyses, linked population health data-sets, and a range of novel data sources is providing new and important insights, but the effects are incremental, not transformative. New medical interventions may be guided by genomic analyses but they must also be subjected to the tenets of evidence based medicine. Propensity matching, instrumental variable analyses and a range of other analytical approaches cannot be relied on to provide completely unbiased estimates of efficacy. On the other hand access to multiply linked biological and population data sets augmented by data from personal monitoring devices opens up new opportunities to define and link the phenome and genome for research on a scale not previously contemplated. This creates challenges for methodologists who have tended to train and work in silos and cannot easily span the full range of analytical techniques that are used. What are the roles of the data scientist of the future and how will we build the necessary capacity in training programs at research universities? The speaker does not have answers to these questions but would like to provoke a discussion of the underlying issues.

 

Patrick J. Heagerty, University of Washington
Pragmatic Trials and the Learning Health Care System

Health care delivery, quality improvement, and clinical research should operate in a coordinated fashion so that patient care and patient outcomes can improve. Pragmatic trials are an important class of research designs that are conducted within health care delivery systems. The context of the delivery system is associated with unique issues regarding appropriate study design, data collection and quality control, and statstical analysis. Statistical leadership is needed so that high quality definitive studies are conducted. In this presentation we review select initiatives within the US, and overview key challenges associated with the multilevel structure of the delivery system. Methods are needed to provide large-scale monitoring of data quality, to design and analyze longitudinal studies, and to provide impactful information to patients and providers. We will use our recent experience as a demonstration project within the NIH Collaboratory to illustrate key issues.

 

Xiachun Li, Indiana University
EMR² : Evidence Mining Research in Electronic Medical Records Towards Better Patient Care

The Indianapolis Network for Patient Care (INPC) was created in 1995 with the goal of providing clinical information at the point of patient care. It houses clinical data from over 80 hospitals, public health departments, local laboratories and imaging centers, surgical centers, and a few large-group practices closely tied to hospital systems, for approximately 13.4 million unique patients. This wealth of data provides great opportunities for comparative effectiveness and pharmaco-epidemiology research leading to knowledge discovery. In this talk, I will present three representative projects with the ultimate goal of better patient care:

-Record Linkage
This is the requite step before better patient care and research. Health information exchanges (HIEs) are increasingly distributed across many sources as our nation moves into an era of electronic health record systems. But HIE data are often from independent databases without a common patient identifier, the lack of which impedes data aggregation, causes waste (e.g., tests repeated unnecessarily), affects patient care and hinders research.

-Benchmark an Electronic Medical Record database
In typical database studies to investigate drug-outcome associations, risk measures are calculated after adjusting for an extended list of possible confounders, and then the strength of drug-outcome association is obtained by comparing the risk estimates against a theoretical null distribution. It has be recognized that electronic health records (EHR) databases are created for routine clinical care and administrative purposes but not for research, and thus they may have more hidden biases. If a list of medications is known not to cause an outcome under study, their association with the outcome (or lack of to be more precise) can be used to estimate a null distribution. Then this estimated null distribution will be used to calibrate the strength of the risk estimate.

-Predictive modeling for clinical decision support (CDS)
This research is towards the ultimate goal of the meaningful use of EMR to achieve better clinical outcomes, improved population health outcomes and increased transparency and efficiency.

I will discuss the statistical approaches, results and the challenges encountered

Erica Moodie, McGill University
Marginal Structural Models

In this workshop, we will consider the definition of a marginal structural model and the assumptions required for its identification. Three approaches to estimation will be presented: inverse probability weighting, g-computation, and g-estimation. The workshop will consist of lecturing, demonstrations, and in-class exercises. A laptop computer will be required for the exercises; participants may wish to complete the exercises individually or in groups of 2-3 people.

Jonas Peters, Max Planck Institute for Intelligent Systems
Three Ideas for Causal Inference

In causal discovery, we are trying to infer the causal structure of the underlying data generating process from some observational data. We review three different ideas that aim at solving this problem: (i) additive noise models, which assume that the involved functions are of a particularly simple form, (ii) constraint-based methods which relate conditional independences in the distribution with a graphical criterion called d-separation and (iii) invariant prediction, which makes use of observing the data generating process in different "environments". We discuss these methods in the context of a gene expression data set (yeast), in which observational and interventional data are available. This talk is meant as a short tutorial. It concentrates on ideas and concepts and does not require any prior knowledge about causality.


Mark Smith, Manitoba Centre for Health Policy
Recent challenges (and opportunities) of working with “Big Data” in health research.

Adding new data to a large existing population-based data repository can present many challenges (in addition to new opportunities). I will explore both sides of this equation using four “big data” case studies: in-hospital pharmaceutical, justice, laboratory and EMR data. Challenges include commitment and trust building with new organizations, size and complexity of data, adequacy of documentation and the ongoing human resource challenges for both uptake and provision of data. Opportunities include filling important gaps in existing knowledge and new kinds of research questions, including moving beyond questions of health care to the social determinants of health.

Mark Smith, Manitoba Centre for Health Policy, Mahmoud Azimaee, Institute for Clinical Evaluative Sciences
Emerging data quality assessments of administrative data for use in research

Broadly defined, data quality means "fitness for use". Administrative data were gathered for a particular purpose - running a program - and can have qualities that are well-suited for that purpose. When the data are adapted to a new purpose, such as research, issues of data quality become especially salient. In broad terms we will outline a data quality framework applicable to the use of administrative data in research and explore its implementation at MCHP (Manitoba) and ICES (Ontario). Topics to be discussed, among others, include completeness, correctness, internal and external validity, stability, link-ability, interpretability and automation.


Elizabeth Stuart, Johns Hopkins University
Using big data to estimate population treatment effects

With increasing attention being paid to the relevance of studies for real-world practice (such as especially in comparative effectiveness research), there is also growing interest in external validity and assessing whether the results seen in randomized trials would hold in target populations. While randomized trials yield unbiased estimates of the effects of interventions in the sample of individuals (or physician practices or hospitals) in the trial, they do not necessarily inform about what the effects would be in some other, potentially somewhat different, population. While there has been increasing discussion of this limitation of traditional trials, relatively little statistical work has been done developing methods to assess or enhance the external validity of randomized trial results. In addition, new “big data” resources offer the opportunity to utilize data on broad target populations. This talk will discuss design and analysis methods for assessing and increasing external validity, as well as general issues that need to be considered when thinking about external validity. The primary analysis approach discussed will be a reweighting approach that equates the sample and target population on a set of observed characteristics. Underlying assumptions, performance in simulations, and limitations will be discussed. Implications for how future studies should be designed in order to enhance the ability to assess generalizability will also be discussed, including particular considerations in “big data."

 

Michael Wolfson, University of Ottawa
Pretty Big Data and Analysis to Understand the Relative Importance of Health Determinants - HealthPaths

While "health" is often associated with hospitals and other high tech interventions, there is a long tradition in social epidemiology of trying to understand the broader determinants of health - from proximal risk factors like obesity and smoking, to distal, especially socio-economic status - sometimes called "the causes of the causes". However, disentangling, let alone quantifying, the various strands in this web of causality is a major challenge. In this presentation, we describe an intensive analysis of Canada's premier longitudinal data set for these questions, Statistics Canada's National Population Health Survey (NPHS). A unique aspect of the statistical analysis is that it has been tightly coupled with the development of a dynamic longitudinal microsimulation model, HealthPaths. The statistical analysis of the NPHS uses elastic net and cross validation methods, and generates millions of coefficients. While these coefficients for the estimated dynamics of risk factors and health status cannot be understood by inspection, their implications can be drawn out and understood using microsimulation. One surprising result is the relatively low importance of obesity, and the relatively high importance of psychological factors like pain, in the overall burden of ill health among Canadians.

 

Hautieng Wu, University of Toronto
Online and adaptive analysis of dynamic periodicity and trend with heteroscedastic and dependent errors -- with clinical applications

Periodicity and trend are features describing an observed sequence, and extracting these features is an important issue in many scientific fields, for example, the epidemiology. However, it is not an easy task for existing methods to analyze simultaneously the dynamic periodicity and trend, and the adaptivity of the analysis to such dynamics and robustness to heteroscedastic, dependent errors is in general not guaranteed. These tasks become even more challenging when there exist multiple periodic components.

We propose the "adaptive harmonic model" to integrate these features, and propose a time-frequency analysis technique called ``synchrosqueezing transform'' (SST) to analyze the model in the presence of a trend and heteroscedastic, dependent errors. The adaptivity and robustness properties of the SST and relevant issues are theoretically justified; the real time analysis and implementation are accomplished. Consequently we have a new technique for de-coupling the trend, periodicity and heteroscedastic, dependent error process in a general setup. In this talk, we will show its application to seasonality analysis of varicella and herpes zoster. The data are obtained from Taiwan national health insurance research database. Several dynamical behaviors extracted by SST will be reported.


Cory Zigler, Harvard University
Uncertainty and treatment effect heterogeneity in comparative effectiveness research

Comparative effectiveness research depends heavily on the analysis of a rapidly expanding universe of observational data made possible by the integration of health care delivery, the availability of electronic medical records, and the development of clinical registries. Despite extraordinary opportunities for research aimed at improving value in health care, a critical barrier to progress relates to the lack of sound statistical methods that can address the multiple facets of estimating treatment effects in large, process-of-care data bases with little a priori knowledge about confounding and treatment effect heterogeneity. When attempting to make causal inferences with such large observational data, researchers are frequently confronted with decisions regarding which of a high-dimensional covariate set are necessary to properly adjust for confounding or define subgroups experiencing heterogeneous treatment effects. To address these barriers, we discuss methods for estimating treatment effects that account for uncertainty in: 1) which of a high-dimensional set of observed covariates are confounders required to estimate causal effects; 2) which (if any) subgroups of the study population experience treatment effects that are heterogeneous with respect to observed factors. We outline two methods rooted in the tenets of Bayesian model averaging. The first prioritizes relevant variables to include in a propensity score model for confounding adjustment while acknowledging uncertainty in the propensity score specification. The second characterizes heterogeneous treatment effects by estimating subgroup-specific causal effects while accounting for uncertainty in the subgroup identification. Causal effects are averaged across multiple model specifications according to empirical support for confounding adjustment and existence of heterogeneous effects. We illustrate with a comparative effectiveness investigation of treatment strategies for brain tumor patients.


Back to top