Networks and cybersecurity are producing varied, rich, complex and BIG data. Great research opportunities are opening up in the statistical, computational and mathematical sciences. The workshop held at CRM in early May showcased the latest statistical and machine learning research for social networks (eg facebook) and cybersecurity. I will provide an overview of some of the most intersting problems.

Jean-Francois Plante, HEC Montréal, Challenges,

Tools and Examples for Big Data Inference

The Opening Conference and Boot Camp of the Thematic Program on Statistical Inference, Learning, and Models for Big Data was held at the Fields Institute in Toronto from January 12th to Janaury 23rd. A total of 35 scientific talks were presented, providing an overview of the main themes of the Program. Even if big data problems from numerous fields were covered, common challenges emerged and some tools were seen in many different contexts. A number of successful applications of big data inference were also presented. In this talk, I will describe those challenges and tools who stood out frequently and will summarize some examples of application that were presented during Opening Conference and Boot Camp. This work is based on a manuscript under preparation by the postdoctoral fellows and long-term visitors of the Fields institute that participated in the Big Data Program.

Lisa Lix, University of Manitoba,

How Big Data and Causal Inference Work Together in Health Policy

Population-based administrative, clinical, and survey databases have long been used to conduct policy-relevant research about population health and health service use. However, there is an increasing emphasis in recent years on person-specific linkages of multiple, complex databases to address novel questions in such areas as drug and medical product safety, chronic disease risk prediction, and comparative effectiveness of medical treatments. Causal inference techniques are routinely applied to observational health databases because of issues of cost, ethics, and selection bias in randomized trials. The March workshop on Big Data in Health Policy explored causal inference methods and applications in the presentation, design, and analysis of health policy research. Related topics pertaining to enabling/disabling factors in the use of Big Data in health policy contexts, methods to combine or synthesize databases or results, and the interdisciplinary nature of the health policy research environment, were explored. This presentation will provide an overview of the sessions, key learnings from participants, and future directions for collaborative research and training.

Stephanie Shipp, Virginia Tech,

Policy meets Social and Decision Informatics

The exponential growth of digital data has created an all data revolution that is allowing us to view the world at a scale and level of granularity that is unprecedented. A similar revolution occurred in the 1930s with the emergence of regularly conducted surveys and probability sampling primarily by federal statistical agencies. Though the timeframe for these survey data are constrained to a monthly, quarterly, or annual basis, the surveys became the primary source of data for large-scale social science research, and over the 80 years we developed principles for managing, analyzing, interpreting, and applying these data that have come to feel intuitive. Digital data are now providing a highly detailed window into our lives on a daily and even minute-by minute basis. The implications for policy are both exciting and challenging. We are offered opportunities to inform social policy development through new insights into (1) how individuals and organizations make choices using, for example, a combination of credit card transactions, GPS tracking, and demographic data and (2) how the opinions, preferences, and interests of individuals interact in collective decisions using social media data. We are challenged to manage transparency, quality, and representativeness of the data. As the all data revolution provides new sources of data to inform policy, it simultaneously requires policy changes. These policy changes will need strategic statistical thinking and innovation to develop pragmatic solutions to use these data for social policy. We lack the 80 years of principles to guide in the reasonable (i.e., scientific), objective, and sensitive management and application of these data to social policy development. The opportunities and challenges for developing these principles are outlined.

Stan Matwin, Dalhousie University,

Big Data meets Big Water: Mining Ocean Vessel Trajectory Data

In this presentation we will focus on the ongoing work in exploration and analysis of data from ocean vessel movements, using the Automatic Identification System (AIS) data. We will discuss some of the challenges and benefits related to the large-scale exploration and analysis of AIS data. We will look at detection of anomalous trajectories of ships in mid-ocean and in port vicinity, and at the ecologically-oriented detection and analysis of data related to fishing activities. We will discuss our early results in these select applications, including data representation and data modeling techniques, particularly the clustering techniques, classification, and attribute engineering used in our work. We will round up with discussion of potential future work with AIS data.

Evangelos Milios, Dalhousie University,

Exploiting Semantic Analysis of Documents for the Domain User

Many document organization tasks, such as a student writing the related work chapter of a thesis, a professor surveying the state of the art in a proposal or planning a reading course, or a conference chair organizing sessions would be performed more efficiently through the use of document clustering. In this work, we present (a) interactive document clustering algorithms that allow the user to steer clustering to her point of view, including an ensemble algorithm based on Wikipedia concepts; (b) named entity recognition and disambiguation using the multilingual Wikipedia category structure; (c) a simple but effective computation of semantic relatedness between words and documents based on the Google n-gram corpus, which is competitive to human performance on standard word pair data sets.

This is joint work with H. Nourashraf, D. Arnold, M. Lipczak, A. Koushkestani, A. Islam and V. Keselj.

Andrew Rau-Chaplin, Dalhousie University,

Scaling up to Big Data: Algorithmic Engineering + HPC

Big data analytics projects apply machine learning techniques to the analysis of large data sets to help uncover relationships and predict outcomes and behaviors. From a research perspective, these projects typically start by using small data sets and focus on identifying those machine learning techniques that are best suited to the problem. Once a promising approach has been identified the next key challenges are performance and scalability – can the method be made to work on truly big data sets in a timely manner?

This talk focusses on the application of algorithmic engineering and high performance computing (HPC) techniques to big data analytics. It draws on practical experience in a range of projects from text analytics to catastrophic risk analysis and tries to highlight algorithmic engineering and HPC approaches that are both widely applicable and often lead to fast scalable applications.

Rosane Minghim, Dalhousie University and University of São Paulo

Multidimensional Projections and Tree-based Techniques for Visualization and Mining

A Multidimensional Projection is a type of technique ultimately aimed at mapping data onto a visual space, usually bi- or tree dimensional. Many algorithms for that task have been developed in recent years, aimed at user control as well as precision and scalability improvement. Tree based techniques are also largely used in the visualization of abstract data with or without hierarchical content. The types of data that can be mapped using these strategies vary widely, and are usually represented either by a set of attributes of by a similarity matrix. In this talk, we approach these two types of algorithms, illustrate their applications for interpretation of complex data, and discuss their capabilities and drawbacks. Additionally, we show how these types of visual approaches to data analysis can be used to support tasks in data mining, such as clustering and classification. We illustrate most of the concepts applying them to the visual analysis of text and image collections.

Rob Beiko, Dalhousie University

Microbial genomics for rapid investigation of infectious disease

In Canada, several agencies carry out surveillance activities to monitor for new infectious disease outbreaks, and coordinate responses to control and eliminate them. These activities are time critical, and delays in infectious agent identification and outbreak mapping can have serious public health consequences. Sequencing the DNA of pathogens will accelerate this response, both by providing rapid and complete information about which specific strain is responsible for a clinical case, and by providing a fine-scale view of the origin and spread of an outbreak. The Integrated Rapid Infectious Disease Analysis (IRIDA) project aims to automate genome sequencing, processing, and pattern inference during a potential outbreak. Realizing the potential of these new approaches requires advances on several fronts, and in my presentation I will focus on the bioinformatic challenges of analyzing thousands of genomes to generate the relevant outbreak data as quickly, reliably, and securely as possible.

Roger Grosse, University of Toronto

Highlights from the deep learning workshop

I will give an overview of some highlights from the Deep Learning Workshop in the Big Data Thematic Program. Deep learning has seen much success recently at automatically finding hierarchical representations of complex, high-dimensional datasets, and has revolutionized application areas from computer vision to speech recognition. Some topics from the workshop include scalable optimization methods for deep learning, interpretable representations, learning fair representations, and applications to reinforcement learning. I will finish by discussing some recent advances in evaluating restricted Boltzmann machines and other Markov random fields as generative models.

Einat Gil, University of Toronto

Learning about Big Data among Secondary School Students in a technology-supported collaborative learning environment

Alongside the thematic program at the Fields Institute, a short unit onlearning about Big Data was designed and implemented in a Toronto secondary school. This three-week interdisciplinary informal statistics unit was developed to allow students in a 12th grade Mathematics for Data Management course to explore both small and Big Data using inquiry and collaborative approaches. In one of the activities, the learning trajectory was guided through an Interactive Orchestrated Learning Space (IOLS; Gil & Slotta, 2015), inspired by recent smart classroom and knowledge community approaches (Slotta, 2010; Slotta, Tissenbaum & Lui, 2013). The design and pedagogical approach allowing for the introduction of ideas related to the use of Big Data in secondary school will be discussed and initial findings about student learning from this mixed methods study will be presented.

Return to main page