Fields Institute -Missing Data


Home About Us People & Contacts Programs & Activities Thematic & Focus Programs General Scientific Activity Commercial & Industrial Programs Centre for Mathematical Medicine Mathematics Education Outreach Calendar of Events Mailing List FieldsLive Video Archive Proposals & Applications Honours, Prizes & Fellowships Publications Resources and Facilities			SCIENTIFIC PROGRAMS AND ACTIVITIES
		January 30, 2026
			Workshop on Missing Data Problems August 5-6, 2004 Speaker Abstracts Back to Workshop home page Jinbo Chen (NIH) Semiparametric efficiency and optimal estimation for missing data problems, with application to auxiliary outcomes This expository talk emphasizes the link between semiparametric efficient estimation and optimal estimating functions in the sense of Godambe and Heyde. We consider models where the linear span of influence functions for regular, asymptotically linear estimators of a Euclidean parameter may be identified as a space of unbiased estimating functions indexed by another class of functions. Determination of the optimal estimating function within this (largest possible) class, which is facilitated by a theorem of Newey and McFadden, then identifies the efficient influence function. The approach seems particularly useful for missing data problems due to a key result by Robins and colleagues: all influence functions for the missing data problem may be constructed from influence functions for the corresponding full data problem. It yields well known results for the conditional mean model in situations where the covariates, the outcome or both are missing at random. When only the outcome is missing, but a surrogate or auxiliary outcome is always observed, the efficient influence function takes a simple closed form. If the covariates and auxiliary outcomes are all discrete, moreover, a Horwitz-Thompson estimator with empirically estimated weights is semiparametric efficient. Back to top Shelley B. Bull Uniiversity of Toronto, Dept of Public Health Sciences and Samuel Lunenfeld Research Institute Missing Data in Family-Based Genetic Association Studies A standard design for family-based association studies involves genotyping of an affected child and their parents, and is often referred to as the case-parent design. More generally, multiple affected children and their unaffected siblings (ie. the whole nuclear family) may also be genotyped. Genotyping provides data concerning what allele at a specific genetic locus is transmitted from each parent to a child, and excess transmission of an allele to affected children provides evidence for genetic association. Immunity to the effects of population stratification is generally achieved by conditioning on parental genotypes in the analysis. Knowledge of transmission may be incomplete however, when one or both parents are unavailable for genotyping, or when parents are genotyped but the genetic marker is less than fully informative, so that transmission is ambiguous. A number of approaches to address this form of missing data have been proposed, ranging from exclusion of families with incomplete data to reconstruction of parental genotypes with use of genotypes from unaffected children to maximum-likelihood-based missing data methods (such as E-M). A conceptually different approach to handling missing data in this setting relies on conditioning on a sufficient statistic for the null hypothesis. Formally, the phenotypes (affected/unaffected) of all family members and the genotypes of the parents constitute a sufficient statistic for the null hypothesis of no excess transmission when parental genotypes are observed. When parental genotypes are missing, a sufficient statistic can still be found, such that under the null, the conditional distribution does not depend on the unknown mode of inheritance or the allele distribution in the population. In this talk we will compare and contrast this alternative approach to some existing methods with respect to test efficiency and robustness to population stratification. Nilanjan Chatterjee National Cancer Institute Missing Data Problems in Statistical Genetics Missing data problems are ubiquitous in statistical genetics. In this talk, I will review the missing data problems posed in a range of topics including segregation analysis, kin-cohort analysis, haplotype-based association studies and use of genomic controls to account for population stratification. In each area, I will briefly review the scientific problem, the data structure, required assumptions and the computational tools that are currently being used. If time permits, I will further describe some work in progress in the area of gene-environment interaction where it may be desirable to use genomic controls to adjust for population stratification. Back to top Cook, Richard and Yi, Grace University of Waterloo Weighted Generalized Estimating Equations for Incomplete Clustered Longitudinal Data Estimating equations are widely used for the analysis of longitudinal data when interest lies in estimating regression or association parameters based on marginal models. Inverse probability weighted estimating equations (e.g. Robins et al., 1995) have been developed to deal with biases that may result from incomplete data which are missing at random (MAR). We consider the problem in which the longitudinal responses arise in clusters, generating a cross-sectional and longitudinal correlation structure. Structures of this type are quite common and arise, for example, in any cluster randomized study with repeated measurements of the response over time. When data are incomplete in such settings, however, the inverse probability weights must be obtained from a model which allows estimation of the pairwise joint probability of missing data for individuals from the same cluster, conditional on their respective histories. We describe such an approach and consider the importance of modeling such a within cluster association in the missing data process. The methods are applied to data from a motivating cluster-randomized school-based study called the Waterloo Smoking Prevention Project. Back to top Joe DiCesare University of Waterloo Estimating Diffusions with Missing Data In this talk the challenges associated with imputation methods for general diffusion processes will be discussed. A method for imputing the values of a square root diffusion process is then presented along with some applications to financial data. Grigoris Karakoulas University of Toronto, Department of Computer Science Mixture-of-Experts Classification under Different Missing Label Mechanisms There has been increased interest in devising classification techniques that combine unlabeled data with labeled data for various domains. There are different mechanisms that could explain why labels might be missing. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias results in biased function approximation with potentially poor performance. In this paper we introduce a mixture-of-experts technique that is a generalization of mixture modeling techniques previously suggested for learning from labeled and unlabeled data. We emprically show how this technique performs under the different missing label mechanisms and compare it with existing techniques. We use the bias-variance decomposition to study the effects from adding unlabeled data when learning a mixture model. Our empirical results indicate that the biggest gain from using unlabeled data comes from the reduction of the model variance, whereas the behavior of the bias error term heavily depends on the correctness of the underlying model assumptions and the missing label mechanism. Back to top Jerry Lawless Department of Statistics and Act. Sci., University of Waterloo Some Problems Concerning Missing Data in Survival and Event History Abalysis There has been considerable recent development of estimation methodology for incomplete survival and event history data. This talk will discuss some areas which deserve attention. They include (i) the assessment and treatment of non-independent losses to followup in longitudinal surveys and other studies with widely spaced inspection times, (ii) the treatment of censoring, loss to followup, and delayed ascertainment in observational cohort studies based on clinic data bases, and (iii) simulation-based model assessment, requiring simulation of the observation process. Alan Lee Department of Statistics, University of Auckland Asymptotic Efficiency Bounds in Semi-Parametric Regression Models We outline an extension of the Bickel, Klaassen, Ritov and Wellner theory of semi-parametric efficiency bounds to the multi-sample case. The theory is then applied to derive efficient scores and information bounds for several standard choice-based sampling situations, including case-control and two-phase outcome-dependent sampling designs. Back to top Roderick Little University of Michigan Robust likelihood-based analysis of multivariate data with missing values The model-based approach to inference from multivariate data with missing values is reviewed. Regression prediction is most useful when the covariates are predictive of the missing values and the probability of being missing, and in these circumstances predictions are particularly sensitive to model misspecification. The use of penalized splines of the propensity score is proposed to yield robust model-based inference under the missing at random (MAR) assumption, assuming monotone missing data. Simulation comparisons with other methods suggest that the method works well in a wide range of populations, with little loss of efficiency relative to parametric models when the latter are correct. Extensions to more general patterns are outlined. KEYWORDS: double robustness, incomplete data, penalized splines, regression imputation, weighting. McLeish, Don and Struthers, Cyntha Regression with Missing Covariates: Importance Sampling and Imputation In regression, it is common for one or more covariates to be unobserved for some of the experimental subjects, either by design or by some random censoring mechanism. Specifically, suppose Y is a response variable, possibly multivariate, with a density function f(y\|x,v; ) conditional on the covariates (x,v) where x and v are vectors and is a vector of unknown parameters. We consider the problem of estimating the parameters when data on the covariate vector v are available for all observations while data on the covariate x are missing for some of the observations. We assume MAR, i.e. ? =1or 0 as the covariate x is observed or not and $E (? i\|Y,X,V)=p(Y,V) where p is a known function depending only on the observable quantities (Y,V). Variations on this problem have been considered by a number of authors, including Chatterjee et al. (2003), Lawless et al. (1999), Reilly and Pepe (1995), Robins et al. (1994, 1995), Carrol and Wand (1991), and Pepe and Fleming (1991). We motivate many of these estimators from the point of view of importance sampling and compare estimators and algorithms for bias and efficiency with the profile estimator when the observations and covariates are discrete or continuous. Back to top Bin Nan University of Michigan A new look at some efficiency results for semiparametric models with missing data Missing data problems arise very often in practice. Many ad hoc useful tools have been developed in estimating finite dimensional parameters from semiparametric regression models with data missing at random. In the mean while, efficient estimation has been paid more and more attention, especially after the landmark paper of Robins, Rotnitzky, and Zhao (1994). We review several examples on information bound calculations. Our main purpose is to show how the general result derived by Robins, Rotnitzky, and Zhao (1994) can apply to different models. Back to top Anastesia Nwankwo Enugu State University Missing multivariate data in banking computations In processing data emanating from multiple files in financial markets, ranking methods are called into play if set probability indices are to be maintained. Horizontal computations yield many evidences of missing entries from nonresponse and other factors James L. Reilly Department of Statistics, University of Auckland Multiple Imputation and Complex Survey Data Multiple imputation is a powerful and widely used method for handling missing data. Following imputation, analysis results for the imputed datasets can easily be combined to estimate sampling variances that include the effect of imputation. However, situations have been identified where the usual combining rules can overestimate these variances. More recently, variance underestimates have also been shown to occur. A new multiple imputation method based on estimating equations has been developed to address these concerns, although this method requires more information about the imputation model than just the analysis results from each imputed dataset. Furthermore, the new method only handles i.i.d. data, which means it would not be appropriate for many surveys. In this talk, this method is extended to accommodate complex sample designs, and is applied to two complex surveys with substantial amounts of missing data. Results will be compared with those from the traditional multiple imputation variance estimator, and the implications for survey practice will be discussed. Back to top James Robins, Professor of Epidemiology and Biostatistics Harvard School of Public Health (this talk is based on joint work with Aad van der Vaart) Application of a Unified Theory of Parametric, Semi, and Nonparametric Statistics Based On Higher Dimensional Influence Functions to Coarsened at Random Missing Data Models The standard theory of semi-parametric inference provides conditions under which a finite dimensional parameter of interest can be estimated at root-n rates in models with finite or infinite dimensional nuisance parameters. The theory is based on likelihoods, first order scores, and first order influence functions and is very geometric in character often allowing results to be obtained without detailed probabilistic epsilon and delta calculations. The modern theory of non-parametric inference determines optimal rates of convergence and optimal estimators for parameters (whether finite or infinite dimensional) that cannot be estimated at rate root-n or better. This theory is based largely based on merging mini-max theory with measures of the size of the parameter space e.g.. its metric entropy and makes little reference to the likelihood function for the data. It often makes great demands on the mathematical and probabilistic skills of its practioners. In this talk I extend earlier work by Small and McLeish (1994) and Waterman and Lindsay (1996) and present a theory based on likelihoods, higher order scores (i.e., derivatives of the likelihood), and higher order influence functions that applies equally to both the root-n and non-root n regimes, reproduces the results previously obtained by the modern theory of non-parametric inference, produces many new non-root- n results, and most importantly is very geometric, opening up the ability to perform optimal non-root n inference in complex high dimensional models without detailed probabilistic calculation. The theory is applied to estimation of functionals of the full data distribution in coarsened at random missing data models. Andrea Rotnitsky Doubly-robust estimation of the area under the operating characteristic curve in the presence of non-ignorable verification bias. The area under the receiver operating characteristic curve (AUC) is a popular summary measure of the efficacy of a medical diagnostic test to discriminate between healthy and diseased subjects. A frequently encountered problem in studies that evaluate a new diagnostic test is that not all patients undergo disease verification because the verification test is expensive, invasive or both. Furthermore, the decision to send patients to verification often depends on the new test and on other predictors of true disease status. In such case, usual estimators of the AUC based on verified patients only are biased. In this talk we develop estimators of the AUC of markers measured on any scale that adjust for selection to verification that may depend on measured patient covariates and diagnostic test results and additionally adjust for an assumed degree of residual selection bias. Such estimators can then be used in a sensitivity analysis to examine how the AUC estimates change when different plausible degrees of residual association are assumed. As with other missing data problems, due to the curse of dimensionality, a model for disease or a model for selection is needed in order to obtain well behaved estimators of the AUC when the marker and/or the measured covariates are continuous. We describe estimators that are consistent and asymptotically normal (CAN) for the AUC under each model. More interestingly, we describe a doubly robust estimator that has the attractive feature of being CAN if either the disease or the selection model (but not necessarily both) are correct. We illustrate our methods with data from a study run by the Nuclear Imaging Group at Cedars Sinai Medical Center on the efficacy of electron beam computed tomography to detect coronary artery disease. Donald B. Rubin John L. Loeb Professor of Statistics, Department of Statistics Multiple Imputation for Item Nonresponse: Some Current Theory and Application to Anthrax Vaccine Experiments at CDC Multiple imputation has become, since its proposal a quarter of a century ago (Rubin 1978), a standard tool for dealing with item nonresponse. There is now widely available free and commercial software for both the analysis of multiply-imputed data sets and for their construction. The methods for their analysis are very straightforward and many evaluations of their frequentist properties, both with artificial and real data, have supported the broad validity of multiple imputation in practice, at least relative to competing methods. The methods for the construction of a multiply-imputed data set, however, either (1) assume theoretically clean situations, such as monotone patterns of missing data or a convenient multivariate distribution, such as the general location model or t-based extensions of it; or (2) use theoretically less well justified, fully conditional "chained equations," which can lead to "incompatible" distributions in theory, which often seem to be harmless in practice. Thus, there remains the challenge of constructing multiply-imputed data sets in situations where the missing data pattern is not monotone or the distribution of the complete data is complex in the sense of being poorly approximated by standard analytic multivariate distributions. A current example that illustrates current work on this issue involves the multiple imputation of missing immunogenicity and reactogenicity measurements in ongoing randomized trials at the US CDC, which compare different versions of vaccinations for protection against lethal doses of inhalation anthrax. The method used to create the imputations involves capitalizing on approximately monotone patterns of missingness to help implement the chained equation approach, thereby attempting to minimize incompatibility; this method extends the approach in Rubin (2003) used to multiply impute the US National Medical Expenditure Survey. Daniel Scharfstein Johns Hopkins Bloomberg School of Public Health Sensitivity Analysis for Informatively Interval-Censored Discrete Time-to-Event Data Coauthors: Michelle Shardell, Noya Galai, David Vlahov, Samuel A. Bozzette In many prospective studies, subjects are evaluated for the occurrence of an absorbing event of interest (e.g., HIV infection) at baseline and at a common set of pre-specified visit times after enrollment. Since subjects often miss scheduled visits, the underlying visit of first detection may be interval censored, or more generally, coarsened. Interval-censored data are usually analyzed using the non-identifiable coarsening at random (CAR) assumption. In some settings, the visit compliance and underlying event time processes may be associated, in which case CAR is violated. To examine the sensitivity of inference, we posit a class of models that express deviations from CAR. These models are indexed by nonidentifiable, interpretable parameters, which describe the relationship between visit compliance and event times. Plausible ranges for these parameters require eliciting information from scientific experts. For each model, we use the EM algorithm to estimate marginal distributions and proportional hazards model regression parameters. The performance of our method is assessed via a simulation study. We also present analyses of two studies: AIDS Clinical Trial Group (ACTG) 181, a natural history study of cytomegalovirus shedding among advanced AIDS patients, and AIDS Link to the Intravenous Experience (ALIVE), an observational study of HIV infection among intravenous drug users. A sensitivity analysis of study results is performed using information elicited from substantive experts who worked on ACTG 181 and ALIVE. Alastair Scott University of Auckland Fitting family-specific models to retrospective family data Case-control studies augmented by the values of responses and covariates from family members allow investigators to study the association of the response with genetics and environment by relating differences in the response directly to within-family differences in the covariates. Most existing approaches to case-control family data parametrize covariate effects in terms of the marginal probability of response, the same effects that one estimates from standard case-control studies. This paper focuses on the estimation of family-specific effects. We note that the profile likelihood approach of Neuhaus, Scott & Wild (2001) can be applied in any setting where one has a fully specified model for the vector of responses in a family and, in particular, to family-specific models such as binary mixed-effects models. We illustrate our approach using data from a case-control family study of brain cancer and consider the use of conditional and weighted likelihood methods as alternatives. Tulay Koru-Sengul Department of Statistics at the University of Pittsburgh The Time-Varying Autoregressive Model With Covariates For Analyzing Longitudinal Data With Missing Values Researchers are frequently faced with the problem of analyzing data with missing values. Missing values are practically unavoidable in large longitudinal studies and incomplete data sets make the statistical analyses very difficult. A new composite method for handling missing values on both the outcome and the covariates has been developed by combining the methods known as the multiple imputation and the stochastic regression imputation. Composite imputation method also uses a new modeling approach for longitudinal data called as the time-varying autoregressive model with time-dependent and/or time-independent covariates. The new model can be thought of as a version of the transition general linear model used to analyze longitudinal data. Simulation results will be discussed to compare the traditional methods to the new composite method of handling missing values on both the outcome and the covariates. Application of the model and the composite method will be studied by using a dataset from a longitudinal epidemiological study of the Maternal Health Practices and Child Development Project that has been conducted at the Magee-Women Hospital in Pittsburgh and the Western Psychiatric Institute and Clinic at the University of Pittsburgh Medical Center Health System. Back to top Jamie Stafford Department of Public Health Sciences, University of Toronto ICE: Iterated Conditional Expectations The use of local likelihood methods (Loader 1999) in the presence of data that is either interval censored, or has been aggregated into bins, leads naturally to the consideration of EM-type strategies. We focus primarily on a class of local likelihood density estimates where one member of this class retains the simplicity and interpretive appeal of the usual kernel density estimate for completely observed data. It is computed using a fixed point algorithm that generalizes the self-consistency algorithms of Efron (1967), Turnbull (1976), and Li et al. (1997) by introducing kernel smoothing at each iteration. Numerical integration permits a local EM algorithm to be implemented as a global Newton iteration where the latter's excellent convergence properties can be exploited. The method requires an explicit solution of the local likelihood equations at the M-step and this can always be found through the use of symbolic Newton-Raphson (Andrews and Stafford 2000). Iteration is thus rendered on the E-step only where a conditional expectation operator is applied, hence ICE. Other local likelihood classes considered include those for intensity estimation, local regression in the context of a generalized linear model, and so on. Back to top Mary Thompson University of Waterloo Interval censoring of event times in the National Population Health Survey The longitudinal nature of the National Population Health Survey allows the use of event history analysis techniques in order to study relationships among events. For example, let T1 and T2 be the times of becoming pregnant and smoking cessation respectively. Thompson and Pantoja Galicia (2002) propose a formal nonparametric test for a partial order relationship. This test involves the estimation of the survivor functions of T1 and T2, as well as the joint distribution of (T1, T2-T1). However, with longitudinal survey data, the times of occurrence of the events are interval censored in general. For example, starting at the second cycle of the NPHS, it is possible to know within an interval of length at most a year whether a smoker has ceased smoking with respect to the previous cycle. Also information about the date of becoming pregnant can be inferred within a time interval from cycle to cycle. Therefore, estimating the joint densities from the interval censored times becomes an important issue and our current problem. We propose a mode of attack based on extending the ideas of Duchesne and Stafford (2001) and Braun, Duchesne and Stafford (2003) to the bivariate case. Our method involves uniform sampling of the censored areas, using a technique described by Tang (1993). Chris Wild Dept of Statistics, University of Auckland Some issues of efficiency and robustness We investigate some issues of efficiency, robustness and study design affecting semiparametric maximum likelihood and survey-weighted analyses for linear regression under two-phase sampling and bivariate binary regressions, especially those occurring in secondary analyses of case-control data. Back to top Grace Yi Dept. of Stat. and Act. Sci., University of Waterloo Median Regression Models for Longitudinal Data with Missing Observations Recently median regression models have received increasing attention. The models are attractive because they are robust and easy to interpret. In this talk I will discuss using median regression models to deal with longitudinal data with missing observations. The inverse probability weighted generalized estimating equations (GEE) approach is proposed to estimate the median parameters for incomplete longitudinal data, where the inverse probability weights are estimated from the regression model for the missing data indicators. The consistency and asymptotic distribution of the resultant estimator are established. Numerical studies for the proposed method will be discussed in this talk. Back to top Yang Zhao Maximum likelihood Methods for Regression Problems with Missing Data Parametric regression models are widely used for the analysis of a response given a vector of covariates. However in many settings certain variable values may go unobserved either by design or happenstance. For the case that some covariates are missing at random (MAR) we discuss maximum likelihood methods for estimating the regression parameters using an EM algorithm. Profile likelihood methods for estimating variances and confidence intervals are also given. The case when response is MAR can be treated similarly. Back to top