SCIENTIFIC PROGRAMS AND ACTIVITIES

April 21, 2014
May 30 - June 1, 2012
Fields Institute Symposium on the Analysis of Survey Data and Small Area Estimation
in honour of the 75th Birthday of Professor J.N.K. Rao

Held at the Schools of Computer Science and Mathematics and Statistics, Herzberg Building (HP)

Organizers
Patrick Farrell (Carleton)
David Haziza (Montreal)
Mike Hidiroglou (Statistics Canada)
Jason Nielsen (Carleton)
Sanjoy Sinha (Carleton) sinha@math.carleton.ca
Wesley Yung (Statistics Canada
)

 

Abstracts

Jean-François Beaumont*, Statistics Canada
The Analysis of Survey Data Using the Bootstrap

The bootstrap is a convenient tool for estimating design variances of finite population parameters or model parameters. It is typically implemented by producing design bootstrap weights that are made available to survey analysts. When analysts are interested in making inferences about model parameters, two sources of variability are normally taken into account: the model that generates data of the finite population and the sampling design. When the overall sampling fraction is negligible, the model variability can be ignored and standard bootstrap techniques that account for the sampling design variability can be used (e.g., Rao and Wu, 1988; Rao, Wu and Yue, 1992). However, there are many practical cases where the model variability cannot be ignored, as evidenced by an empirical study. We show how to modify design bootstrap weights in a simple way to account for the model variability. The analyst may also be interested in testing hypotheses about model parameters. This can be achieved by replicating a simple weighted model-based test statistic using the bootstrap weights. Our approach can be viewed as a bootstrap version of the Rao-Scott test (Rao-Scott, 1981). We illustrate through a simulation study that both methods perform better than the standard Wald or Bonferroni test.

David Bellhouse*, University of Western Ontario
The Teaching of a First Course in Survey Sampling Motivated by Survey Data Analysis

Survey sampling is often seen to be out of sync with many of the statistics courses that students take over their undergraduate or early graduate programs. Historically, most topics in sampling courses were motivated by the production of estimates from surveys run by government agencies. Compared to the time when I studied from William Cochran's Sampling Techniques, the situation in survey sampling has changed substantially, and for the better, with the current use of Sharon Lohr's book Sampling: Design and Analysis. In one sense, she has followed a traditional approach: a general discussion of survey design, followed by estimation techniques for means, totals and proportions and then some topics in survey data analysis late in the book. This approach differs from many courses in statistics taught today, for example regression and experimental design, where the statistical theory is often motivated by problems in data analysis constrained by the study design. Over the past three years I have experimented with an approach to teaching a first course in survey sampling by motivating survey estimation through survey data analysis. Essentially, I asked the question: What if I begin the course with some of the analysis topics that appear late in Lohr's book thus bringing the course more in line with other statistics courses? My talk will focus on the techniques that I used and the results of this experiment.

Gauri S. Datta*, University of Georgia
Benchmarking Small Area Estimators

In this talk, we consider benchmarking issues in the context of small area estimation. We find optimal estimators within the class of benchmarked linear estimators under either external or internal benchmark constraints. This extends existing results for both external and internal benchmarking, and also provides some links between the two. In addition, necessary and sufficient conditions for self-benchmarking are found for an augmented model. Most of our results are found using ideas of orthogonal projection. To illustrate the results of this paper, we present an example using a model and data from the Census Bureau's Small Area Income and Poverty Estimates (SAIPE) program.

This is a joint work with W.R. Bell of U.S. Census Bureau, Washington, D.C. 20233, U.S.A., and Malay Ghosh, Department of Statistics, University of Florida, Gainesville, Florida 32611, U.S.A.

Patrick J. Farrell*, Carleton University, Brenda MacGibbon*, Université du Québec à Montréal, Gillian Bartlett, McGill University, Thomas J. Tomberlin, Carleton University
The Estimating Function Jackknife Variance Estimator in a Marginal Logistic Regression Model with Repeated Measures and a Complex Sample Design: Its Small Sample Properties and an Application

One of the most important aspects of modeling binary data with repeated measures under a complex sampling design is to obtain efficient variance estimators in order to test for covariates, and to perform overall goodness-of fit tests of the model to the data. Influenced by the work of Rao (Rao 1998, Rao and Tausi 2004, Roberts et al. 2009), we use his adaptation of estimation functions and the estimating function bootstrap to the marginal logistic model with repeated measures and complex survey data in order to obtain estimating function jackknife variance estimators. In particular, we conduct Monte Carlo simulations in order to study the level and power of tests using this estimator proposed by Rao. The method is illustrated on an interesting data set based on questionnaires concerning the willingness of patients to allow their e-health data to be used for research.

Robert E. Fay*, Westat
The Multiple Facets of Small Area Estimation: A Case Study

Prof. Rao's many contributions to the literature on small area estimation are widely recognized and acknowledged. This talk selects just one example from his work as an illustration. The Rao-Yu model and subsequent variants provide methods for small area estimation that incorporate time-series aspects. The models are suitable for a set of sample observations for a characteristic observed at multiple time points. Unlike some other proposals, the Rao-Yu model accommodates sample observations correlated across time, as would be typical of panel or longitudinal surveys. This talk describes an application and some extensions of this model to the National Crime Victimization Survey (NCVS) in the U.S. Until now, the almost exclusive focus of the NCVS has been to produce national estimates of victimizations by type of crime annually. The talk describes the attraction, but also some of the challenges, of applying the Rao-Yu model to produce annual state estimates of crime from the NCVS. The talk will note how small area applications are typically multi-faceted, in the sense that often elements of science, technology, engineering, and mathematics (STEM) must be brought together for effective result.

Moshe Feder*, University of Southampton
State-Space Modelling of U.K. Labour Force Survey Rolling Quarterly Wave Data

We propose a multivariate state space model for the U.K. Labour Force Survey rolling quarterly wave data. The proposed model is based upon a basic structural model with seasonality, an auto-regressive survey error model and a mode effect (person-to-person vs. telephone interviews). The proposed approach takes advantage of the temporal structure of the data to improve estimates and to extract its unobserved components. Two alternatives for modelling the seasonal component will also be discussed. Finally, we'll present some simulation results.

Wayne A. Fuller*, Iowa State University
Small Area Prediction: Some Comments

Small area prediction using complex sample survey data is reviewed, emphasizing those aspects of estimation impacted by the survey design. Variance models, nonlinear models, benchmarking, and parameter estimation are discussed.

Malay Ghosh*, University of Florida, Rebecca Steorts University of Florida
Two-Stage Bayesian Benchmarking as Applied to Small Area Estimation


The paper considers two-stage benchmarking. We consider a single weighted squared error loss that combines the loss at the domain-level and the area-level. We benchmark the weighted means at each level or both the weighted means and the weighted variability, the latter only at the domain-level. We provide also multivariate versions of these results. Finally, we illustrate our methods using a study from the National Health Interview Survey (NHIS) in the year 2000. The goal was to estimate the proportion of people without health insurance for many domains of the Asian subpopulation.

David Haziza*, Université de Montréal, Audrey Béliveau, Simon Fraser University, Jean-François Beaumont, Statistics Canada
Simplified Variance Estimation in Two-Phase Sampling

Two-phase sampling is often used in practice when the sampling frame contains little or no information. In two-phase sampling, the total variance of an estimator can be expressed as the sum of two terms: the first-phase variance and the second-phase variance. Estimating the total variance can be achieved by estimating each term separately. However, the resulting variance estimator may not be easy to compute in practice because it requires the second-order inclusion probabilities for the second-phase sampling design, which may not be tractable. Also, it requires specialized software for variance estimation for two-phase sampling designs. In this presentation, we consider a simplified variance estimator that does not depend on the second-order inclusion probabilities of the second-phase sampling design and that can be computed using software designed for variance estimation in single-phase sampling designs. The simplified variance estimator is design-biased, in general. We present strategies under which the bias is small, where a strategy consists of the choice of a sampling design and a point estimator. Results of a limited simulation study that investigates the performance of the proposed simplified estimator in terms of relative bias will be shown.

M.A. Hidiroglou*, Statistics Canada, V. Estevao, Statistics Canada, Y. You, Statistics Canada
Unit Level Small Area Estimation for Business Survey Data

Direct estimators parameters for small areas of interest use regression estimators relying on well-correlated auxiliary data. As the domains get smaller, such estimators will become unreliable, and one option is to use small-area procedures. The one that we will illustrate in this presentation extends the unit level procedure originally proposed by Battese, Harter, and Fuller (1988) to include the following additional requirements. Firstly, the parameter that we estimate is a weighted mean of observations; secondly, the errors associated with the nested error regression model are heteroskedastic; and lastly, the survey weights are included in the estimation.

Three point estimators and their associated estimated mean squared errors are given in this presentation. The first one does not use the survey weights (EBLUP) but the last two (Pseudo-EBLUP) do make use of the survey weights. One of the Pseudo-EBLUP estimators was developed originally by Rubin-Bleuer et al. (2007). These three estimators are all implemented in the small-area prototype being developed at Statistics Canada (Estevao, Hidiroglou and You 2012). We illustrate the application of the proposed models and methods to real business survey data.

Jiming Jiang*, University of California, Davis
The EMAF and E-MS Algorithms: Model Selection with Incomplete Data

In this talk, I will present two computer-intensive strategies of model selection with incomplete data that we recently developed. The E-M algorithm is well-known for parameter estimation in the presence of missing data. On the other hand, model selection, as another key component of model identification, may also be viewed as parameter estimation, with the parameter being [the identification (ID) number of] the model and the parameter space being the (ID numbers of the) model space. From this point of view, it is intuitive to consider extension of the E-M to model selection in the presence of missing data. Our first strategy, called the EMAF algorithm, is motivated by a recently developed procedure for model selection, known as the adaptive fence (AF), that is incorporated with the E-M algorithm to model in the missing data situation. Our second strategy, called the E-MS algorithm, is a more direct extension of the E-M algorithm to model selection problems with missing data. This work is joint with Thuan Nguyen of the Oregon Health and Science University and J. Sunil Rao of the University of Miami.


Graham Kalton*, Westat
Design and Analysis of Venue-Based Samples

Venue-based sampling-also known as location sampling, center sampling, or intercept sampling-samples a population by collecting data from individuals contacted at a sample of locations and time periods. This method, which is often used for sampling hard-to-reach populations, involves multiple frames. This paper describes the 2008 venue-based survey of men who have sex with men (MSM) in the U.S. Centers for Disease Control and Prevention's National HIV Behavioral Surveillance (NHBS) system. The venues were mostly places such as bars, clubs, and social organizations where MSM congregate and the time periods were parts of days when the venues were open, sampled on a monthly basis over a six month period. The paper discusses the calculation of survey weights for such samples, using the 2008 NHBS survey as an example. Implications for the design of venue-based samples are also discussed.

Jae-Kwang Kim*, Iowa State University, Sixia Chen, Iowa State University
Two-Phase Sampling Approach for Propensity Score Estimation in Voluntary Samples

Voluntary sampling is a non-probability sampling design whose sample inclusion probabilities are unknown. When the sample inclusion probability depends on the study variables being observed, the popular approach of the propensity score adjustment using the auxiliary information available for the population may lead to biased estimation. In this paper, we propose a novel application of the two-phase sampling idea to estimate the parameters in the propensity model. To apply the proposed method, we apply an experiment of making a second attempt of the data collection to the original sample and obtain a subset of the original sample, called second phase sample. Under this two-phase sampling experiment, we can estimate the parameters in the propensity score model using the calibration and the propensity score adjustment can be used to estimate the population parameters from the original voluntary sample. Once the propensity scores are estimated, we can incorporate additional auxiliary variables from the reference distribution by a calibration method. Results from some simulation studies are also presented.

Phillip Kott*, Research Triangle Institute
One Step or Two? Calibration Weighting when Sampling from a Complete List Frame

When a random sample drawn from a complete list frame suffers from unit nonresponse, calibration weighting can be used to remove nonresponse bias under either an assumed response or an assumed prediction model. Not only can this provide double protection against nonresponse bias, it can also decrease variance. By employing a simple trick one can simultaneously estimate the variance under the assumed prediction model and the mean squared error under the combination of an assumed response model and the probability-sampling mechanism in a relatively simple manner. Unfortunately, there is a practical limitation on what response model can be assumed when calibrating in a single step. In particular, the response function cannot always be logistic. This limitation does not hinder calibration weighting when performed in two steps: one to remove the response bias and one to decrease variance. There are efficiency advantages from using the two-step approach as well. Simultaneous linearized mean-squared-error estimation, although still possible, is not as straightforward.


Snigdhansu Chatterjee, University of Minnesota, Partha Lahiri*, University of Maryland, College Park
Parametric Bootstrap Methods in Small Area Estimation Problems

In small area estimation, empirical best prediction (EBP) methods are routinely used in combining information from various data sources. One major challenge for this approach is the estimation of an accurate mean squared error (MSE) of EBP that captures all sources of variations. But, the basic requirements of second-order unbiasedness and non-negativity of the MSE estimator of an EBP have led to different complex analytical adjustments in different MSE estimation techniques. We suggest a parametric bootstrap method to replace laborious analytical calculations by computer-oriented simple techniques, without sacrificing the basic requirements in an MSE estimator. The method works for a general class of mixed models and different techniques of parameter estimation.

Isabel Molina*, Universidad Carlos III de Madrid, Balgobin Nandram, Worcester Polytechnic Institute, J.N.K. Rao, Carleton University
Hierarchical Bayes Estimation of Poverty Indicators in Small Areas

A new methodology is proposed for estimation of general non linear parameters in small areas, based on the Hierarchical Bayes approach. Only non-informative priors are considered and, with these priors, Markov chain Monte Carlo procedures and the convergence problems therein are avoided. At the same time, less computational effort is required. Results are compared in simulations with those obtained using the empirical Bayes methodology under a frequentist point of view. This methodology is illustrated through the problem of estimation of poverty indicators as particular cases of non linear parameters. Finally, an application to poverty mapping in Spanish provinces by gender is carried out.


Esther López-Vizcaíno, María José Lombardía, Domingo Morales*, Universidad Miguel Hernandez de Elche
Small Area Estimation of Labour Force Indicators under Multinomial Mixed Models with Time and Area Effects

The aim of this paper is the estimation of small area labour force indicators like totals of employed and unemployed people and unemployment rates. Small area estimators of these quantities are derived from a multinomial logit mixed model with time and area random effects. Mean squared errors are used to measure the accuracy of the proposed estimators and they are estimated by explicit formulas and bootstrap methods. The behavior of the introduced estimators is empirically investigated in simulation experiments. The introduced methodology is applied to real data from the Spanish Labour Force Survey of Galicia.

Ralf T. Münnich*, Ekkehard W. Sachs, Matthias Wagner University of Trier, Germany
Calibration Benchmarking for Small Area Estimates: An Application to the German Census 2011

In the 2010/11 Census round in Europe, several countries introduced new methodologies. Countries like Germany and Switzerland decided to apply a register-assisted method. In addition to using the population register data, a sample is drawn in order to allow for estimating values which are not available in the register. Generally, these estimates have to be considered on different (hierarchical) aggregation levels. Additionally, several cross classifications in different tables with overlapping marginal distributions are of interest. On higher aggregation levels classical design-based methods may be preferable whereas in lower aggregation levels small area techniques are more appropriate. The variety of aggregation levels in connection with different estimation methods may then lead to severe coherence problems. The present paper focuses on a specialized calibration problem which takes into account the different kinds of estimates for areas and domains while using penalization procedures. The procedure allows understanding possible problematic constraints in order to enable the end-user to relax certain boundary conditions to achieve better overall results. An application to the German Census 2011 will be given.

Balgobin Nandram*, Worcester Polytechnic Institute
A Bayesian Analysis of a Two-Fold Small Area Model for Binary Data

We construct a hierarchical Bayesian model for binary data, obtained from a number of small areas, and we use it to make inference for the finite population proportion of each area. Within each area there is a two-stage cluster sampling design and a two-fold model incorporates both an intracluster correlation (between two units in the same cluster) and an intercluster correlation (between two units in different clusters). The intracluster correlation is important because it is used to accommodate the increased variability due to the clustering effect and we study it in detail. Using two goodness of fit Bayesian procedures, we compare our two-fold model with a standard one-fold model which does not include the intracluster correlation. Although the Gibbs sampler is the natural way to fit the two-fold model, we show that random samples can be used, thereby providing a faster and more efficient algorithm. We describe an example on the Third International Mathematics and Science Study and a simulation study to compare the two models.While the one-fold model gives estimates of the proportions with smaller posterior standard deviations, our goodness of fit procedures show that the two-fold model is to be preferred and the simulation study shows that the two-fold model has much better frequentist properties than the one-fold model.

Danny Pfeffermann*Hebrew University of Jerusalem and Southampton Statistical Sciences Research Institute
Model Selection and Diagnostics for Model-Based Small Area Estimation

(Joint paper with Dr. Natalie Shlomo and Mr. Yahia El-Horbaty)
Model selection and diagnostics is one of the difficult aspects of model-based small area estimation (SAE) because the models usually contain random effects at one or more levels of the model, which are not observable. Careful model testing is required under both the frequentist approach and the Bayesian approach. It is important to emphasize also that misspecification of the distribution of the random effects may affect the specification of the functional (fixed) part of the model and conversely, misspecification of the functional relationship may affect the specification of the distribution of the random effects.

In the first part of my talk I shall review recent developments in model checking for model-based SAE under the Bayesian and frequentist approaches and propose a couple of new model testing procedures. A common feature of articles studying model specification is that they usually only report the performance of the newly developed methods. In the second part of my talk I shall compare empirically several frequentist procedures of model testing in terms of the computations involved and the powers achieved in rejecting mis-specified models.

Shu Jing Gu, University of Alberta, N.G.N. Prasad*, University of Alberta
Estimation of Median for Successive Sampling

Survey practitioners widely use successive sampling methods to estimate characteristic changes over time. The main concern in such methods is the nonresponse, due to the fact that the same individuals are sampled repeatedly to observe responses. To overcome this problem, partial replacement sampling (rotation sampling) schemes are adopted. However, most of the methods available in the literature are focused towards estimation of linear parameters such as a mean or a total. Recently, some attempts have been made to the estimation of population quantiles under repeated sampling schemes. However, these methods are restricted to simple random sampling and also require estimation of the density function of the underlying characteristics. The present work that we are presenting uses estimating approach to obtain estimates for population quantiles under repeated sampling when units are selected under unequal probability sampling scheme. Performance of the proposed method to estimate a finite population median is examined through a simulation study as well as using real data sets.

J. Sunil Rao*, University of Miami
Fence Methods for Mixed Model Selection

This talk reviews a body of work related to some ideas for mixed model selection. It's meant to provide a bridge to some new work that will also be presented at the symposium on model selection with missing data.

Many model search strategies involve trading off model fit with model complexity in a penalized goodness of fit measure. Asymptotic properties for these types of procedures in settings like linear regression and ARMA time series have been studied, but these do not naturally extend to non-standard situations such as mixed effects models. I will detail a class of strategies known as fence methods which can be used quite generally including for linear and generalized linear mixed model selection. The idea involves a procedure to isolate a subgroup of what are known as correct models (of which the optimal model is a member). This is accomplished by constructing a statistical fence, or barrier, to carefully eliminate incorrect models. Once the fence is constructed, the optimal model is selected from amongst those within the fence according to a criterion which can be made flexible. A variety of fence methods can be constructed, based on the same principle but applied to different situations, including clustered and non-clustered data, linear or generalized linear mixed models, and Gaussian or non-Gaussian random effects. I will illustrate some via simulations and real data analyses. In addition, I will also show how we used an adaptive version of the fence method for fitting non-parametric small area estimation models and quite differently, how we developed an invisible fence for gene set analysis in genomic problems.

This is joint work with Jiming Jiang of UC-Davis and Thuan Nguyen of Oregon Health and Science University.

Louis-Paul Rivest*, Université Laval
Copula-Based Small Area Predictions for Unit Level Models

Small area predictions are typically constructed with multivariate normal models. We suggest a decomposition of the standard normal unit level small area model in terms of a normal copula for the dependency within a small area and marginal normal distributions for the unit observations. It turns out that the copula is a key ingredient for the construction of small area predictions. This presentation introduces copula based small area predictions. They are semi-parametric in the sense that they do not make any assumption on the marginal distribution of the data. They are valid as long as the assumed copula captures the residual dependency between the unit observations within a small area. Besides the normal copula, multivariate Archimedean copulas can also be used to construct small area predictions. This provides a new perspective on small area predictions when the normality assumption fails as studied by Sinha & Rao (2009). This reports joint work with François Verret of Statistics Canada.


A.K.Md. Ehsanes Saleh*, Carleton University
Model Assisted Improved Estimation of Population Totals

Consider a finite population P of size N partitioning into k strata P1, P2, …, Pk, with sizes N1, N2, …, Nk, respectively. Each stratum is then subdivided into p sub-strata. We estimate the population total, T = T1 + T2 +… + Tk by estimating the h-th strata totals, Th using model assisted methodology based on the regression models for each of the sub-strata PNhj, namely yhj = ?hj1Nhj + ßhjxhj + ehj, ehj ~ NNhj(0, s2INhj) when it is suspected that ßh1 = ßh2 = … = ßhp = ß0 (unknown). Let Shj be a random sample of size nhj from Phj so that nh1 + nh2 + … + nhp = nh. Accordingly, we define five estimators, namely, (i) the unrestricted estimator (UE), , (ii) the restricted estimator (RE), , (iii) the preliminary test estimator (PTE), , (iv) James-Stein-type estimators (SE), , and (v) the positive-rule Stein-type estimator (PRSE), , and compare their dominance properties. It is shown that = = uniformly for p = 3 while = , , and under the equality of the slopes. The PTE dominance depends of the level of significance of PT-test. For p = 2, we provide alternative estimators.

Fritz Scheuren*, Human Rights Data Analysis Group
Indigenous Population: Small Domain Issues

Small domain issues exist in nearly all settings. Some, like in the work of JNK Rao, can be addressed by statistical models and sound inferences obtained. Some require that additional issues, such as potential misclassifications and record linkage errors, be addressed.

For many years there has been a small group of researchers focused on aboriginal issues working internationally (in Canada, Australia, New Zealand, and the United States). Now in all these countries a modest fraction of the Indigenous peoples still live in close proximity in relatively homogeneous (usually rural) communities. However, many of the Indigenous, maybe half or more, depending on the country, live widely disbursed among the general population.

Many Indigenous suffer from life style issues, like diabetes, that were not native to them. Many, too, despite racial prejudice, have intermarried and in some cases are indistinguishable from the general population. Given these diverse circumstances, measuring differential indigenous mortality and morbidity is extremely difficult. Still, it is to these latter concerns that the beginnings of an international plan of action are addressed in this paper.

Junheng Ma, National Institute of Statistical Sciences, J. Sedransk*, Case Western Reserve University
Bayesian Predictive Inference for Finite Population Quantities under Informative Sampling

We investigate Bayesian predictive inference for finite population quantities when there are unequal probabilities of selection. Only limited information about the sample design is available, i.e., only the first-order selection probabilities corresponding to the sample units are known. Our probabilistic specification is similar to that of Chambers, Dorfman and Wang (1998). We make inference for finite population quantities such as the mean and quantiles and provide credible intervals. Our methodology, using Markov chain Monte Carlo methods, avoids the necessity of using asymptotic approximations. A set of simulated examples shows that the informative model provides improved precision over a standard ignorable model, and corrects for the selection bias.

Jeroen Pannekoek, Statistics Netherlands, Natalie Shlomo*, Southampton Statistical Sciences Research Institute, Ton de Waal, Statistics Netherlands
Calibrated Imputation of Numerical Data under Linear Edit Restrictions

A common problem faced by statistical offices is that data may be missing from collected datasets. The typical way to overcome this problem is to impute the missing data. The problem of imputing missing data is complicated by the fact that statistical data often have to satisfy certain edit rules and that values of variables sometimes have to sum up to known totals. The edit rules are most often formulated as linear restrictions on the variables that have to be satisfied by the imputed data. For example, for data on enterprises edit rules could be that the profit and costs of an enterprise should sum up to its turnover and that that the turnover should be at least zero. The totals of some variables may already be known from administrative data (turnover from a tax register) or estimated from other sources. Standard imputation methods for numerical data as described in the literature generally do not take such edit rules and totals into account. We describe algorithms for imputation of missing numerical data that take edit restrictions into account and ensure that sums are calibrated to known totals. These algorithms are based on a sequential regression approach that uses regression predictions to impute the variables one by one. For each missing value to be imputed we first derive a feasible interval in which the imputed value must lie in order to make it possible to impute the remaining missing values in the same unit in such a way that the imputed data for that unit satisfy the edit rules and sum constraints. To assess the performance of the imputation methods a simulation study is carried out as well as an evaluation study based on a real dataset.

Chris Skinner*, London School of Economics and Political Science
Extending Missing Data Methods to a Measurement Error Setting

We consider a measurement error setting, where y denotes the true value of a variable of interest, y* denotes the value of y measured in a survey and z denotes an observed ordinal indicator of the accuracy with which y* measures y. This generalizes the classical missing data setting where z is binary, z = 1 denotes fully accurate measurement with y* = y and z = 0 denotes missing data with the value of y* set equal to a missing value code. The more general setting is motivated by an application where y is gross pay rate and z is an interviewer assessment of respondent accuracy or an indicator of whether the respondent consulted a pay slip when responding. We discuss possible approaches to inference about the finite population distribution function of y. We focus on a parametric modelling approach and the use of pseudo maximum likelihood. Modelling assumptions are discussed in the context of data from the British Household Panel Survey and an associated validation study of earnings data. We indicate a possible extension using parametric fractional imputation. This paper is joint work with Damiao da Silva (Southampton) and Jae-Kwang Kim (Iowa State).


Mary Thompson*, University of Waterloo
Bootstrap Methods in Complex Surveys

This talk presents a review of the bootstrap in the survey sampling context, with emphasis on the Rao and Wu 1988 paper which is still the main reference on this topic, and more recent work of Rao and others on bootstrapping with estimating functions. Some interesting problems, addressable in part by bootstrapping, arise in inference for the parameters of multilevel models when the sampling design is a multistage design and the inclusion probabilities are possibly informative, and in inference for network parameters under network sampling.

Mahmoud Torabi*, University of Manitoba
Spatio-temporal Modeling of Small Area Rare Events

In this talk, we use generalized Poisson mixed models for the analysis of geographical and temporal variability of small area rare events. In this class of models, spatially correlated random effects and temporal components are adopted. Spatio-temporal models that use conditionally autoregressive smoothing across the spatial dimension and B-spline smoothing over the temporal dimension are considered. Our main focus is to make inference for smooth estimates of spatio-temporal small areas. We use data cloning, which yields maximum likelihood, to conduct frequentist analysis of spatio-temporal modeling of small area rare events. The performance of the proposed approach is evaluated with applying to a real dataset and also through a simulation study.


Lola Ugarte*, University of Navarre
Deriving Small Area Estimates from the Information Technology Survey

Knowledge of the current state of the art in information technology (IT) of businesses in small areas is very important for Central and Local Governments, markets, and policy-makers because information technology allows to collect information, to improve the access to information, and to be more competitive. Information about IT is obtained through the Information Technology Survey which is a common survey in Western countries. In this talk, we focus on developing small area estimators based on a double logistic model with categorical explanatory variables to obtain information about the penetration of IT in the Basque Country establishments in 2010. Auxiliary information for population totals is taken from a Business Register. A model-based bootstrap procedure is also given to provide the prediction MSE.

Changbao Wu*, University of Waterloo, Jiahua Chen, University of British Columbia and Jae-Kwang Kim, Iowa State University
Semiparametric Fractional Imputation for Complex Survey Data

Item nonresponses are commonly encountered in complex surveys. However, it is also common that certain baseline auxiliary variables can be observed for all units in the sample. We propose a semiparametric fractional imputation method for handling item nonresponses. Our proposed strategy combines the strengths of conventional single imputation and multiple imputation methods, and is easy to implement even with a large number of auxiliary variables available, which is typically the case for large scale complex surveys. A general theoretical framework will be presented and results from a comprehensive simulation study will be reported. This is joint work with Jiahua Chen of University of British Columbia and Jae-Kwang Kim of Iowa State University.

Yong You*, Statistics Canada, Mike Hidiroglou, Statistics Canada
Sampling Variance Smoothing Methods for Small Area Proportions

Sampling variance smoothing is an important topic in small area estimation. In this paper, we study sampling variance smoothing methods for small area estimators for proportions. We propose two methods to smooth the direct sampling variance estimators for proportions, namely, the design effects method and generalized variance function (GVF) method. In particular, we will show the proposed smoothing methods based on the design effects and GVF are equivalent. The smoothed sampling variance estimates will then be treated as known in the area level models for small area estimation. We evaluate and compare the smoothed variance estimates based on the proposed methods through the analysis of different survey data from Statistics Canada including LFS, CCHS, and PALS. The proposed sampling variance smoothing methods can also be applied and extended to more general estimation problems including proportions and counts estimation.

Susana Rubin-Bleuer*, Statistics Canada, Wesley Yung*, Statistics Canada, Sébastien Landry*, Statistics Canada
Variance Component Estimation through the Adjusted Maximum Likelihood Method

Estimation of variance components is a fundamental part of small area estimation. Unfortunately standard estimation methods frequently produce negative estimates of the strictly positive model variances. As a result the resulting Empirical Best Linear Unbiased Predictor (EBLUP) could be subject to significant bias. Adjusted maximum likelihood estimators, which always yield positive variance estimators, have been recently studied by Li and Lahiri (2010) for the classical Fay-Herriot (1979) small area model. In this presentation, we propose an extension of Li and Lahiri to the time series and cross-sectional small area model proposed by Rao-Yu (1994). Theoretical properties are discussed, along with some empirical results.

Rebecca Steorts*, University of Florida and Malay Ghosh, University of Florida
On Estimation of Mean Squared Errors of Benchmarked Empirical Bayes Estimators

We consider benchmarked empirical Bayes (EB) estimators under the basic area-level model of Fay and Herriot while requiring the standard benchmarking constraint. In this paper we determine how much mean squared error (MSE) is lost by constraining the estimates through benchmarking. We show that the increase due to benchmarking is O(m-1), where m is the number of small areas. Furthermore, we find an asymptotically unbiased estimator of this MSE and compare it to the second-order approximation of the MSE of the EB estimator or equivalently of the MSE of the empirical best linear unbiased predictor (EBLUP), which was derived by Prasad and Rao (1990). Moreover, using methods similar to those of Butar and Lahiri (2003), we compute a parametric bootstrap estimate of the MSE of the benchmarked EB estimate under the Fay-Herriot model and compare it to the MSE of the benchmarked EB estimate found by a second-order approximation. Finally, we illustrate our methods using SAIPE data from the U.S. Census Bureau.

 

 


Top