## Workshop on Missing Data Problems

August 5-6, 2004

## Speaker Abstracts

**Jinbo
Chen (NIH)**
*Semiparametric efficiency and optimal estimation for missing data problems,
with application to auxiliary outcomes*

This expository talk emphasizes the link between semiparametric efficient
estimation and optimal estimating functions in the sense of Godambe and
Heyde. We consider models where the linear span of influence functions
for regular, asymptotically linear estimators of a Euclidean parameter
may be identified as a space of unbiased estimating functions indexed
by another class of functions. Determination of the optimal estimating
function within this (largest possible) class, which is facilitated by
a theorem of Newey and McFadden, then identifies the efficient influence
function. The approach seems particularly useful for missing data problems
due to a key result by Robins and colleagues: all influence functions
for the missing data problem may be constructed from influence functions
for the corresponding full data problem. It yields well known results
for the conditional mean model in situations where the covariates, the
outcome or both are missing at random. When only the outcome is missing,
but a surrogate or auxiliary outcome is always observed, the efficient
influence function takes a simple closed form. If the covariates and auxiliary
outcomes are all discrete, moreover, a Horwitz-Thompson estimator with
empirically estimated weights is semiparametric efficient.

Back to top
**Shelley B. Bull**

Uniiversity of Toronto, Dept of Public Health Sciences and Samuel Lunenfeld
Research Institute

*Missing Data in Family-Based Genetic Association Studies*

A standard design for family-based association studies involves genotyping
of an affected child and their parents, and is often referred to as
the case-parent design. More generally, multiple affected children and
their unaffected siblings (ie. the whole nuclear family) may also be
genotyped. Genotyping provides data concerning what allele at a specific
genetic locus is transmitted from each parent to a child, and excess
transmission of an allele to affected children provides evidence for
genetic association. Immunity to the effects of population stratification
is generally achieved by conditioning on parental genotypes in the analysis.
Knowledge of transmission may be incomplete however, when one or both
parents are unavailable for genotyping, or when parents are genotyped
but the genetic marker is less than fully informative, so that transmission
is ambiguous. A number of approaches to address this form of missing
data have been proposed, ranging from exclusion of families with incomplete
data to reconstruction of parental genotypes with use of genotypes from
unaffected children to maximum-likelihood-based missing data methods
(such as E-M). A conceptually different approach to handling missing
data in this setting relies on conditioning on a sufficient statistic
for the null hypothesis. Formally, the phenotypes (affected/unaffected)
of all family members and the genotypes of the parents constitute a
sufficient statistic for the null hypothesis of no excess transmission
when parental genotypes are observed. When parental genotypes are missing,
a sufficient statistic can still be found, such that under the null,
the conditional distribution does not depend on the unknown mode of
inheritance or the allele distribution in the population. In this talk
we will compare and contrast this alternative approach to some existing
methods with respect to test efficiency and robustness to population
stratification.

**Nilanjan Chatterjee**

National Cancer Institute

*Missing Data Problems in Statistical Genetics*

Missing data problems are ubiquitous in statistical genetics. In
this talk, I will review the missing data problems posed in a range
of topics including segregation analysis, kin-cohort analysis, haplotype-based
association studies and use of genomic controls to account for population
stratification. In each area, I will briefly review the scientific problem,
the data structure, required assumptions and the computational tools
that are currently being used. If time permits, I will further describe
some work in progress in the area of gene-environment interaction where
it may be desirable to use genomic controls to adjust for population
stratification.

Back to top

**Cook, Richard and Yi, Grace**

University of Waterloo

*Weighted Generalized Estimating Equations for Incomplete Clustered
Longitudinal Data *

Estimating equations are widely used for the analysis of longitudinal
data when interest lies in estimating regression or association parameters
based on marginal models. Inverse probability weighted estimating equations
(e.g. Robins et al., 1995) have been developed to deal with biases that
may result from incomplete data which are missing at random (MAR). We
consider the problem in which the longitudinal responses arise in clusters,
generating a cross-sectional and longitudinal correlation structure.
Structures of this type are quite common and arise, for example, in
any cluster randomized study with repeated measurements of the response
over time. When data are incomplete in such settings, however, the inverse
probability weights must be obtained from a model which allows estimation
of the pairwise joint probability of missing data for individuals from
the same cluster, conditional on their respective histories. We describe
such an approach and consider the importance

of modeling such a within cluster association in the missing data process.
The methods are applied to data from a motivating cluster-randomized
school-based study called the Waterloo Smoking Prevention Project.

Back to top

**Joe ****DiCesare**

University of Waterloo

*Estimating Diffusions with Missing Data*

In this talk the challenges associated with imputation methods for
general diffusion processes will be discussed. A method for imputing
the values of a square root diffusion process is then presented along
with some applications to financial data.

**Grigoris Karakoulas**

University of Toronto, Department of Computer Science

*Mixture-of-Experts Classification under Different Missing Label Mechanisms*

There has been increased interest in devising classification techniques
that combine unlabeled data with labeled data for various domains. There
are different mechanisms that could explain why labels might be missing.
It is possible for the labeling process to be associated with a selection
bias such that the distributions of data points in the labeled and unlabeled
sets are different. Not correcting for such bias results in biased function
approximation with potentially poor performance. In this paper we introduce
a mixture-of-experts technique that is a generalization of mixture modeling
techniques previously suggested for learning from labeled and unlabeled
data. We emprically show how this technique performs under the different
missing label mechanisms and compare it with existing techniques. We
use the bias-variance decomposition to study the effects from adding
unlabeled data when learning a mixture model. Our empirical results
indicate that the biggest gain from using unlabeled data comes from
the reduction of the model variance, whereas the behavior of the bias
error term heavily depends on the correctness of the underlying model
assumptions and the missing label mechanism.

Back to top

**Jerry Lawless**

Department of Statistics and Act. Sci., University of Waterloo

*Some Problems Concerning Missing Data in Survival and Event History
Abalysis*

There has been considerable recent development of estimation methodology
for incomplete survival and event history data. This talk will discuss
some areas which deserve attention. They include (i) the assessment
and treatment of non-independent losses to followup in longitudinal
surveys and other studies with widely spaced inspection times, (ii)
the treatment of censoring, loss to followup, and delayed ascertainment
in observational cohort studies based on clinic data bases, and (iii)
simulation-based model assessment, requiring simulation of the observation
process.

**Alan Lee**

Department of Statistics, University of Auckland

*Asymptotic Efficiency Bounds in Semi-Parametric Regression Models*

We outline an extension of the Bickel, Klaassen, Ritov and Wellner theory
of semi-parametric efficiency bounds to the multi-sample case. The theory
is then applied to derive efficient scores and information bounds for
several standard choice-based sampling situations, including case-control
and two-phase outcome-dependent sampling designs.

Back to top

**Roderick Little**

University of Michigan

*Robust likelihood-based analysis of multivariate data with missing
values*

The model-based approach to inference from multivariate data with missing
values is reviewed. Regression prediction is most useful when the covariates
are predictive of the missing values and the probability of being missing,
and in these circumstances predictions are particularly sensitive to
model misspecification. The use of penalized splines of the propensity
score is proposed to yield robust model-based inference under the missing
at random (MAR) assumption, assuming monotone missing data. Simulation
comparisons with other methods suggest that the method works well in
a wide range of populations, with little loss of efficiency relative
to parametric models when the latter are correct. Extensions to more
general patterns are outlined. KEYWORDS: double robustness, incomplete
data, penalized splines, regression imputation, weighting.

**McLeish, Don and Struthers, Cyntha**

*Regression with Missing Covariates: Importance Sampling and Imputation*

In regression, it is common for one or more covariates to be unobserved
for some of the experimental subjects, either by design or by some random
censoring mechanism. Specifically, suppose Y is a response variable,
possibly multivariate, with a density function f(y|x,v; ) conditional
on the covariates (x,v) where x and v are vectors and is a vector of
unknown parameters. We consider the problem of estimating the parameters
when data on the covariate vector v are available for all observations
while data on the covariate x are missing for some of the observations.
We assume MAR, i.e. ? =1or 0 as the covariate x is observed or not and
$E (? i|Y,X,V)=p(Y,V) where p is a known function depending only on
the observable quantities (Y,V). Variations on this problem have been
considered by a number of authors, including Chatterjee et al. (2003),
Lawless et al. (1999), Reilly and Pepe (1995), Robins et al. (1994,
1995), Carrol and Wand (1991), and Pepe and Fleming (1991). We motivate
many of these estimators from the point of view of importance sampling
and compare estimators and algorithms for bias and efficiency with the
profile estimator when the observations and covariates are discrete
or continuous.

Back to top

**Bin Nan**

University of Michigan

*A new look at some efficiency results for semiparametric models with
missing data*

Missing data problems arise very often in practice. Many ad hoc useful
tools have been developed in estimating finite dimensional parameters
from semiparametric regression models with data missing at random. In
the mean while, efficient estimation has been paid more and more attention,
especially after the landmark paper of Robins, Rotnitzky, and Zhao (1994).
We review several examples on information bound calculations. Our main
purpose is to show how the general result derived by Robins, Rotnitzky,
and Zhao (1994) can apply to different models.

Back to top

**Anastesia Nwankwo**

Enugu State University

*Missing multivariate data in banking computations*

In processing data emanating from multiple files in financial markets,
ranking methods are called into play if set probability indices are
to be maintained. Horizontal computations yield many evidences of missing
entries from nonresponse and other factors

**James L. Reilly**

Department of Statistics, University of Auckland

*Multiple Imputation and Complex Survey Data*

Multiple imputation is a powerful and widely used method for handling
missing data. Following imputation, analysis results for the imputed
datasets can easily be combined to estimate sampling variances that
include the effect of imputation. However, situations have been identified
where the usual combining rules can overestimate these variances. More
recently, variance underestimates have also been shown to occur. A new
multiple imputation method based on estimating equations has been developed
to address these concerns, although this method requires more information
about the imputation model than just the analysis results from each
imputed dataset. Furthermore, the new method only handles i.i.d. data,
which means it would not be appropriate for many surveys. In this talk,
this method is extended to accommodate complex sample designs, and is
applied to two complex surveys with substantial amounts of missing data.
Results will be compared with those from the traditional multiple imputation
variance estimator, and the implications for survey practice will be
discussed.

Back to top

**James Robins**, Professor of Epidemiology
and Biostatistics

Harvard School of Public Health

(this talk is based on joint work with Aad van der Vaart)

*Application of a Unified Theory of Parametric, Semi, and Nonparametric
Statistics Based On Higher Dimensional Influence Functions to Coarsened
at Random Missing Data Models *

The standard theory of semi-parametric inference provides conditions
under which a finite dimensional parameter of interest can be estimated
at root-n rates in models with finite or infinite dimensional nuisance
parameters. The theory is based on likelihoods, first order scores,
and first order influence functions and is very geometric in character
often allowing results to be obtained without detailed probabilistic
epsilon and delta calculations.

The modern theory of non-parametric inference determines optimal rates
of convergence and optimal estimators for parameters (whether finite
or infinite dimensional) that cannot be estimated at rate root-n or
better. This theory is based largely based on merging mini-max theory
with measures of the size of the parameter space e.g.. its metric entropy
and makes little reference to the likelihood function for the data.
It often makes great demands on the mathematical and probabilistic skills
of its practioners.

In this talk I extend earlier work by Small and McLeish (1994) and
Waterman and Lindsay (1996) and present a theory based on likelihoods,
higher order scores (i.e., derivatives of the likelihood), and higher
order influence functions that applies equally to both the root-n and
non-root n regimes, reproduces the results previously obtained by the
modern theory of non-parametric inference, produces many new non-root-
n results, and most importantly is very geometric, opening up the ability
to perform optimal non-root n inference in complex high dimensional
models without detailed probabilistic calculation.

The theory is applied to estimation of functionals of the full data
distribution in coarsened at random missing data models.

**Andrea Rotnitsky**

*Doubly-robust estimation of the area under the operating characteristic
curve in the presence of non-ignorable verification bias.*

The area under the receiver operating characteristic curve (AUC) is
a popular summary measure of the efficacy of a medical diagnostic test
to discriminate between healthy and diseased subjects. A frequently
encountered problem in studies that evaluate a new diagnostic test is
that not all patients undergo disease verification because the verification
test is expensive, invasive or both. Furthermore, the decision to send
patients to verification often depends on the new test and on other
predictors of true disease status. In such case, usual estimators of
the AUC based on verified patients only are biased. In this talk we
develop estimators of the AUC of markers measured on any scale that
adjust for selection to verification that may depend on measured patient
covariates and diagnostic test results and additionally adjust for an
assumed degree of residual selection bias. Such estimators can then
be used in a sensitivity analysis to examine how the AUC estimates change
when different plausible degrees of residual association are assumed.
As with other missing data problems, due to the curse of dimensionality,
a model for disease or a model for selection is needed in order to obtain
well behaved estimators of the AUC when the marker and/or the measured
covariates are continuous. We describe estimators that are consistent
and asymptotically normal (CAN) for the AUC under each model. More interestingly,
we describe a doubly robust estimator that has the attractive feature
of being CAN if either the disease or the selection model (but not necessarily
both) are correct. We illustrate our methods with data from a study
run by the Nuclear Imaging Group at Cedars Sinai Medical Center on the
efficacy of electron beam computed tomography to detect coronary artery
disease.

**Donald B. Rubin**

John L. Loeb Professor of Statistics, Department of Statistics

*Multiple Imputation for Item Nonresponse: Some Current Theory and
*

Application to Anthrax Vaccine Experiments at CDC

Multiple imputation has become, since its proposal a quarter of a century
ago (Rubin 1978), a standard tool for dealing with item nonresponse.
There is now widely available free and commercial software for both
the analysis of multiply-imputed data sets and for their construction.
The methods for their analysis are very straightforward and many evaluations
of their frequentist properties, both with artificial and real data,
have

supported the broad validity of multiple imputation in practice, at
least relative to competing methods. The methods for the construction
of a multiply-imputed data set, however, either (1) assume theoretically
clean

situations, such as monotone patterns of missing data or a convenient
multivariate distribution, such as the general location model or t-based
extensions of it; or (2) use theoretically less well justified, fully
conditional "chained equations," which can lead to "incompatible"
distributions in theory, which often seem to be harmless in practice.
Thus, there remains the challenge of constructing multiply-imputed data
sets in situations where the missing data pattern is not monotone or
the distribution of the complete data is complex in the sense of being
poorly approximated by standard analytic multivariate distributions.
A current example that illustrates current work on this issue involves
the multiple imputation of missing immunogenicity and reactogenicity
measurements in ongoing randomized trials at the US CDC, which compare
different versions of vaccinations for protection against lethal doses
of inhalation anthrax. The method used to create the imputations involves
capitalizing on approximately monotone patterns of missingness to help
implement the chained equation approach, thereby attempting to minimize
incompatibility; this method extends the approach in Rubin (2003) used
to multiply impute

the US National Medical Expenditure Survey.

**Daniel Scharfstein**

Johns Hopkins Bloomberg School of Public Health

*Sensitivity Analysis for Informatively Interval-Censored Discrete
Time-to-Event Data *

Coauthors: Michelle Shardell, Noya Galai, David Vlahov, Samuel A. Bozzette

In many prospective studies, subjects are evaluated for the occurrence
of an absorbing event of interest (e.g., HIV infection) at baseline
and at a common set of pre-specified visit times after enrollment. Since
subjects often miss scheduled visits, the underlying visit of first
detection may be interval censored, or more generally, coarsened. Interval-censored
data are usually analyzed using the non-identifiable coarsening at random
(CAR) assumption. In some settings, the visit compliance and underlying
event time processes may be associated, in which case CAR is violated.
To examine the sensitivity of inference, we posit a class of models
that express deviations from CAR. These models are indexed by nonidentifiable,
interpretable parameters, which describe the relationship between visit
compliance and event times. Plausible ranges for these parameters require
eliciting information from scientific experts. For each model, we use
the EM algorithm to estimate marginal distributions and proportional
hazards model regression parameters. The performance of our method is
assessed via a simulation study. We also present analyses of two studies:
AIDS Clinical Trial Group (ACTG) 181, a natural history study of cytomegalovirus
shedding among advanced AIDS patients, and AIDS Link to the Intravenous
Experience (ALIVE), an observational study of HIV infection among intravenous
drug users. A sensitivity analysis of study results is performed using
information elicited from substantive experts who worked on ACTG 181
and ALIVE.

**Alastair Scott**

University of Auckland

*Fitting family-specific models to retrospective family data*

Case-control studies augmented by the values of responses and covariates
from family members allow investigators to study the association of
the response with genetics and environment by relating differences in
the response directly to within-family differences in the covariates.
Most existing approaches to case-control family data parametrize covariate
effects in terms of the marginal probability of response, the same effects
that one estimates from standard case-control studies. This paper focuses
on the estimation of family-specific effects. We note that the profile
likelihood approach of Neuhaus, Scott & Wild (2001) can be applied
in any setting where one has a fully specified model for the vector
of responses in a family and, in particular, to family-specific models
such as binary mixed-effects models. We illustrate our approach using
data from a case-control family study of brain cancer and consider the
use of conditional and weighted likelihood methods as alternatives.

**Tulay Koru-Sengul**

Department of Statistics at the University of Pittsburgh

*The Time-Varying Autoregressive Model With Covariates For Analyzing
Longitudinal Data With Missing Values*

Researchers are frequently faced with the problem of analyzing data
with missing values. Missing values are practically unavoidable in large
longitudinal studies and incomplete data sets make the statistical analyses
very difficult.

A new composite method for handling missing values on both the outcome
and the covariates has been developed by combining the methods known
as the multiple imputation and the stochastic regression imputation.
Composite imputation method also uses a new modeling approach for longitudinal
data called as the time-varying autoregressive model with time-dependent
and/or time-independent covariates. The new model can be thought of
as a version of the transition general linear model used to analyze
longitudinal data. Simulation results will be discussed to compare the
traditional methods to the new composite method of handling missing
values on both the outcome and the covariates.

Application of the model and the composite method will be studied by
using a dataset from a longitudinal epidemiological study of the Maternal
Health Practices and Child Development Project that has been conducted
at the Magee-Women Hospital in Pittsburgh and the Western Psychiatric
Institute and Clinic at the University of Pittsburgh Medical Center
Health System.

Back to top

**Jamie Stafford**

Department of Public Health Sciences, University of Toronto

*ICE: Iterated Conditional Expectations*

The use of local likelihood methods (Loader 1999) in the presence of
data that is either interval censored, or has been aggregated into bins,
leads naturally to the consideration of EM-type strategies. We focus
primarily on a class of local likelihood density estimates where one
member of this class retains the simplicity and interpretive appeal
of the usual kernel density estimate for completely observed data. It
is computed using a fixed point algorithm that generalizes the self-consistency
algorithms of Efron (1967), Turnbull (1976), and Li et al. (1997) by
introducing kernel smoothing at each iteration.

Numerical integration permits a local EM algorithm to be implemented
as a global Newton iteration where the latter's excellent convergence
properties can be exploited. The method requires an explicit solution
of the local likelihood equations at the M-step and this can always
be found through the use of symbolic Newton-Raphson (Andrews and Stafford
2000). Iteration is thus rendered on the E-step only where a conditional
expectation operator is applied, hence ICE. Other local likelihood classes
considered include those for intensity estimation, local regression
in the context of a generalized linear model, and so on.

Back to top

**Mary Thompson**

University of Waterloo

*Interval censoring of event times in the National Population Health
Survey*

The longitudinal nature of the National Population Health Survey allows
the use of event history analysis techniques in order to study relationships
among events. For example, let T1 and T2 be the times of becoming pregnant
and smoking cessation respectively. Thompson and Pantoja Galicia (2002)
propose a formal nonparametric test for a partial order relationship.
This test involves the estimation of the survivor functions of T1 and
T2, as well as the joint distribution of (T1, T2-T1). However, with
longitudinal survey data, the times of occurrence of the events are
interval censored in general. For example, starting at the second cycle
of the NPHS, it is possible to know within an interval of length at
most a year whether a smoker has ceased smoking with respect to the
previous cycle. Also information about the date of becoming pregnant
can be inferred within a time interval from cycle to cycle. Therefore,
estimating the joint densities from the interval censored times becomes
an important issue and our current problem. We propose a mode of attack
based on extending the ideas of Duchesne and Stafford (2001) and Braun,
Duchesne and Stafford (2003) to the bivariate case. Our method involves
uniform sampling of the censored areas, using a technique described
by Tang (1993).

**Chris Wild**

Dept of Statistics, University of Auckland

*Some issues of efficiency and robustness*

We investigate some issues of efficiency, robustness and study design
affecting semiparametric maximum likelihood and survey-weighted analyses
for linear regression under two-phase sampling and bivariate binary
regressions, especially those occurring in secondary analyses of case-control
data.

Back to top

**Grace Yi**

Dept. of Stat. and Act. Sci., University of Waterloo

*Median Regression Models for Longitudinal Data with Missing Observations*

Recently median regression models have received increasing attention.
The models are attractive because they are robust and easy to interpret.
In this talk I will discuss using median regression models to deal with
longitudinal data with missing observations. The inverse probability
weighted generalized estimating equations (GEE) approach is proposed
to estimate the median parameters for incomplete longitudinal data,
where the inverse probability weights are estimated from the regression
model for the missing data indicators. The consistency and asymptotic
distribution of the resultant estimator are established. Numerical studies
for the proposed method will be discussed in this talk.

Back to top

**Yang Zhao**

*Maximum likelihood Methods for Regression Problems with Missing Data*

Parametric regression models are widely used for the analysis of a
response given a vector of covariates. However in many settings certain
variable values may go unobserved either by design or happenstance.
For

the case that some covariates are missing at random (MAR) we discuss
maximum likelihood methods for estimating the regression parameters
using an EM algorithm. Profile likelihood methods for estimating variances
and confidence intervals are also given. The case when response is MAR
can be treated similarly.

Back to top