# SCIENTIFIC PROGRAMS AND ACTIVITIES

October 19, 2018
THE FIELDS INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES
 May 22-23, 2014 CANSSI--SAMSI Workshop: Geometric Topological and Graphical Model Methods in Statistics at the Fields Institute, 222 College St. SPEAKER ABSTRACTS

Ejaz Ahmed, Brock, (Mathematics and Statistics)

Emanuel Ben-David (Columbia)
On the minimum number of observations that guarantees the existence of MLE for Gaussian graphical models

In this talk I will discuss the conditions under which the existence of the MLE of the covariance parameter in Gaussian graphical models is guaranteed. These conditions are given in terms of an upper bound and a lower bound on the number of observations that ensures the existence of the MLE. These bounds are in general tighter than the best known bounds found in Buhl [1993] and Lauritzen [1996]. In fact, for many graphs, such as grids, the new bounds are tight enough to exactly determine the minimum number of needed observations.

Joseph Beyene, McMaster (Biostatistics)
Constrained Dirichlet-Multinomial Mixture Models for Human Microbiome Analysis

Human microbiomes are microscopic populations of organisms which live in or on a single human being and may impact the health of the host. Studying these populations has become more prevalent because of Next-Generation Sequencing, which can approximately characterize a microbiome. The discrete, skewed nature of the data makes understanding microbiomes difficult, creating a need for new statistical methods. We propose an evolutionary algorithm that takes a proposed Dirichlet-Multinomial model and determines which parameters can be constrained to be equal, thereby creating a more robust model. This choice of constraint is particularly interesting because of its implications in characterizing relative abundances throughout a microbiome. We will illustrate the algorithm on simulated and real microbiome data and discuss ongoing challenges.

Peter Bubenik, Cleveland State (Mathematics)
Statistical topological data analysis using persistence landscapes (slides)

In this talk I will define a topological summary for data that I call the persistence landscape. Since this summary lies in a vector space, it is easy to calculate averages of such summaries, and distances between them. Viewed as a random variable with values in a Banach space, this summary obeys a Strong Law of Large Numbers and a Central Limit Theorem. I will show how a number of standard statistical tests can be used for statistical inference using this summary.

David Dunson, Duke University (Statistics)
Robust and scalable Bayes via the median posterior

Bayesian methods have great promise in big data sets, but this promise has not been fully realized due to the lack of scalable computational methods. Usual MCMC and SMC algorithms bog down as the size of the data and number of parameters increase. For massive data sets, it has become routine to rely on penalized optimization approaches implemented on distributed computing systems. The most popular scalable approximation algorithms rely on variational Bayes, which lacks theoretical guarantees and badly under-estimates posterior covariance. Another problem with Bayesian inference is the lack of robustness; data contamination and corruption is particularly common in large data applications and cannot easily be dealt with using traditional methods. We propose to solve both the robustness and the scalability problem using a new alternative to exact Bayesian inference we refer to as the median posterior. Data are divided into subsets and stored on different computers prior to analysis. For each subset, we obtain a stochastic approximation to the full data posterior, and run MCMC to generate samples from this approximation. The median posterior is defined as the geometric median of the subset-specific approximations, and can be rapidly approximated. We show several strong theoretical results for the median posterior, including general theorems on concentration rates and robustness. The methods are illustrated through a simulation and application to nonparametric modeling of contingency table data from social surveys.

Joint work with Stas Minsker, Lizhen Lin and Sanvesh Srivastava

Subhashis Ghosal, North Carolina State University, (Statistics)
Bayesian estimation of sparse precision matrices (slides)

We consider the problem of estimating the sparsity structure of the precision matrix for a multivariate Gaussian distribution, especially for dimension p exceeding the sample size n. Gaussian graphical models serve as an important tool in representing the sparsity structure through the presence or absence of the edges in the underlying graph. Some novel methods for Bayesian analysis of graphical models have been explored in recent times using a Bayesian analog of the graphical LASSO algorithm using suitable priors. In this talk, we use priors which put point mass on the zero elements of the precision matrix along with absolutely continuous priors on the non-zero elements, and hence the resulting posterior distribution can be used for graphical structure learning. The posterior distribution of the different graphical models is intractable and we propose a fast computational method for approximating the posterior probabilities of the graphical structures using Laplace approximation method using the graphical LASSO solution as the posterior mode. We also theoretically asses the quality of the Laplace approximation. We study the asymptotic behavior of the posterior distribution of sparse precision matrices and show that it converges at the oracle rate with respect to the Frobenius norm on matrices. The proposed Bayesian method is studied by extensive simulation experiments and is found to be extremely fast and give very sensible results. The method is applied on a real dataset on stocks and is able to find sensible relations between different stock types.
This talk is based on joint work with Sayantan Banerjee.

Elizabeth Gross, NCSU
Goodness-of-fit testing for log-linear network models

Social networks and other large sparse data sets pose significant challenges for statistical inference, as many standard statistical methods for testing model/data fit are not applicable in such settings. Algebraic statistics offers an approach to goodness-of-fit testing that relies on the theory of Markov bases and is intimately connected with the geometry of the model as described by its fibers.
Most current practices require the computation of the entire basis, which is infeasible in many practical settings. In this talk, we present a dynamic approach to explore the fiber of a model, which bypasses this issue, and is based on the combinatorics of hypergraphs arising from the toric algebra structure of log-linear models.
We demonstrate the approach on the Holland-Leinhardt p1 model for random directed graphs that allows for reciprocated edges.

Giseon Heo, University of Alberta (Dentistry, Statistics)
Beyond Mode Hunting (slides)

The scale space has been studied in the context of blurring in computer vision, smooth curve estimation in statistics, and persistent feature detection in computational topology. We review the background of three approaches and discuss how persistent homology can be useful in high dimensions.

Stephan Huckemann, Goettingen (Stochastiks)
Circular Scale Spaces and Mode Persistence for Measuring Early Stem Cell Differentiation (slides)

We generalize the SiZer of Chaudhuri and Marron (1999, 2000) for the detection of shape parameters of densities on the real line to the case of circular data. It turns out that only the wrapped Gaussian kernel gives a symmetric, strongly Lipschitz semi-group satisfying "circular" causality, i.e. not introducing possibly artificial modes with increasing levels of smoothing. Based on this we provide for an asymptotic theory to infer on persistence of shape features. The resulting circular mode persistence diagram is applied to the analysis of early mechanically induced differentiation in adult human stem cells from their actin- myosin filament structure. In consequence the circular SiZer based on the wrapped Gaussian kernel (WiZer) allows to discriminate at a controlled error level between three different micro-environments impacting early stem cell differentiation. Joint work with Kwang-Rae Kim, Axel Munk, Florian Rehfeld, Max Sommerfeld, Joachim Weickert and Carina Wollnik

Georges Michailidis, Michigan(Statistics)
Estimation in High-Dimensional Vector Autoregressive Models (slides)

Vector Autoregression (VAR) is a widely used method for learning complex interrelationship among the components of multiple time series. Over the years it has gained popularity in the fields of control theory, statistics, economics, finance, genetics and neuroscience. We consider the problem of estimating stable VAR models in a high-dimensional setting, where both the number of time series and the VAR order are allowed to grow with sample size. In addition to the curse of dimensionality" introduced by a quadratically growing dimension of the parameter space, VAR estimation poses considerable challenges due to the temporal and cross-sectional dependence in the data. Under a sparsity assumption on the model transition matrices, we establish estimation and prediction consistency of $\ell$1-penalized least squares and likelihood based methods. Exploiting spectral properties of stationary VAR processes, we develop novel theoretical techniques that provide deeper insight into the effect of dependence on the convergence rates of the estimates. We study the impact of error correlations on the estimation problem and develop fast, parallelizable algorithms for penalized likelihood based VAR estimates.

Washington Mio, FSU (Mathematics)
On Genetic Determinants of Facial Shape Variation (slides)

Mapping genetic determinants of phenotypic variation is a major challenge in biology and medicine. The problem arises in contexts such as investigation of development, inheritance and evolution of phenotypic traits, and studies of the role of genetics in diseases. Shape is a ubiquitous trait whose biological relevance spans multiple scales – from organelles to cells through organs and tissues to entire organisms. Accurate and biologically interpretable shape quantification enables investigation of fundamental questions about the genetic underpinnings of normal and pathological morphological variation. In this talk, I will discuss an ongoing collaborative genome wide association study of human facial shape variation with an emphasis on the morphometric aspects of the study, which uses geometric and topological methods to model facial shape.

Sayan Murherjee, Duke (Statistical Sciences)

Victor Patrangenaru, FSU (Statistics)
Data Analysis on Manifolds (slides)

While seeking answers to the fundamental question of what data analysis should be all about, it is useful to go to the basic notion of variability, that separates Statistics from all other sciences; one soon realizes that there are two inescapable theoretical ideas in data analysis. Firstly, one may quantify variability within or between samples only in terms of a certain distance on the sample space telling how far are observed sample points from each other. Secondly, the distance, as a function of the two data points separated by it, has to have some continuity property, to make any consistency statement possible justifying why the larger the sample, the closer the sample variability measure to its population counterpart. In addition, since an asymptotic theory, based on random observations is necessary to estimate the population variance based on a large sample, such a theory can be formulated only under the additional assumption of differentiability of some power of the square distance function.
In summary, data analysis imposes some sort of differentiable structure on the sample space, that has to be consequently either a manifold, or having some manifold related structure, no matter what the nature of the objects is. However the overwhelming number of Statistics users are specializing more in understanding the nature of the objects themselves, having little or no exposure to the basics of geometry and topology of manifolds knowledge needed to develop appropriate of methodology for data analysis. At the same time, theoretical mathematicians who have a reasonable knowledge about manifolds, might be unfamiliar with nonparametric multivariate statistics, while computational grad students and computational data analysts involved with nonlinear data are sometime asking for a sound statistics methodology, or for some sort of a multidimensional differential geometry or topology toolkit, that may help them design fast algorithms for data analysis. To answer such demands, we would like to structure our presentation as follows. Firstly, introduce the basics for the three "pillars of data analysis": (i) notrivial examples of data, (ii) nonparametric multivariate statistics and (iii) geometry and topology of manifolds. Secondly we develop a general methodology based on (i) and (ii), and "translate" this methodology, in the context of certain manifolds arising in statistics. Finally, apply this methodology to concrete examples of data analysis.

Thanh Mai Pham Ngoc, Paris-Orsay (Mathematics)
Goodness-of-fit test for Noisy Directional Data (abstract image)

We consider spherical data $X_i$ noised by a random rotation $\varepsilon_i \in \mathrm{SO(3)}$ so that only the sample $Z_i = \varepsilon_iX_i, i = 1,\ldots,N$ is observed. We define a nonparametric test procedure to distinguish $H_0:$the density $f$ of $X_i$ is the uniform density $f_0$ on the sphere'' and $H_1:$ $\parallel f-f_0\parallel^2_2 \geq \mathcal C\psi_N$ and $f$ is in a Sobolev space with smoothness $s$''. For a noise density $f_\varepsilon$ with smoothness index $\nu$, we show that an adaptive procedure (i.e. $s$ is not assumed to be known) cannot have a faster rate of separation than $\psi^{ad}_N(s) = (N/\sqrt{\log\log(N)})^{-2s/(2s+2\nu+1)}$ and we provide a procedure which reaches this rate. We also deal with the case of super smooth noise. We illustrate the theory by implementing our test procedure for various kinds of noise on $\mathrm{SO(3)}$ and by comparing it to other procedures. Finally, we apply our test to real data in astrophysics and paleomagnetism.

Bala Rajaratnam
, Stanford (Statistics)
Methods for Robust High Dimensional Graphical Model Selection (slides)

Learning high dimensional correlation and partial correlation graphical network models is a topic of contemporary interest. A popular approach is to use L1 regularization methods to induce sparsity in the inverse covariance estimator, leading to sparse partial covariance/correlation graphs. Such approaches can be grouped into two classes: (1) regularized likelihood methods and (2) regularized regression-based, or pseudo-likelihood, methods. Regression based methods have the distinct advantage that they do not explicitly assume Gaussianity. One major gap in the area is that none of the popular methods proposed for solving regression based objective functions have provable convergence guarantees, and hence it is not clear if these methods lead to estimators which are always computable. It is also not clear if resulting estimators actually yield correct partial correlation/partial covariance graphs. To this end, we propose a new regression based graphical model selection method that is both tractable and has provable convergence guarantees. In addition we also demonstrate that our approach yields estimators that have good large sample and finite sample properties. The methodology is successfully illustrated on both real and simulated data with a view to applications to big data problems. We also present a novel unifying framework that places various pseudo-likelihood graphical model selection methods as special cases of a more general formulation, leading to important insights. (Joint work with S. Oh and K. Khare).

Elena Villa, Universita' degli Studi di Milano (Mathematics)
Different Kinds of Estimators of the mean density of random closed sets: Theoretical Results and Numerical experiements (slides)

Many real phenomena may be modelled as random closed sets in $R^d$ , of different Hausdorff dimensions. Of particular interest are cases in which their Hausdorff dimension, say $n$, is strictly less than $d$, such as fiber processes, boundaries of germ-grain models, and $n$-facets of random tessellations. The mean density, say $L_Qn$ , of a random closed set $Q_n$ in $R^d$ with Hasudorff dimension $n$ is defined to be the density of the measure $E[H^n(Q_n\sqcap· )]$ on $R^d$, whenever it is absolutely continuous with respect to $H^d$. A crucial problem is the pointwise estimation of $L_Qn$ . In this talk we present three different kinds of estimators of $L_Qn(x)$; the first one will follow as a natural consequence of the Besicovitch derivation theorem; the second one will follow as a generalization to the $n$-dimensional case of the classical kernel density estimator of random vectors; the last one will follow by a local approximation of $L_Qn$ based on a stochastic version of the $n$-dimensional Minkowski content of $Q_n$.
We will study the unbiasedness and consistency properties, and identify optimal bandwidths for all proposed estimators, under sufficient regularity conditions. Finally, we will provide a set of simulations via typical examples of lower dimensional random sets.