Content - 4722e0051590d7c9a97183d6e8c3ae7d84fadcab - 08b39e3/vignettes/laeken-variance.Rnw

visit type:
https://github.com/cran/laeken

10 February 2024, 08:56:05 UTC
Tip revision: b68d7cd5e97fcc9879ea1767c72c4c0ef125e9c1 authored by Andreas Alfons on 25 January 2024, 12:30:08 UTC
version 0.5.3
Tip revision: b68d7cd
laeken-variance.Rnw
\documentclass[a4paper,10pt]{scrartcl}
\usepackage[OT1]{fontenc}
\usepackage{Sweave}

%% additional packages
\usepackage{natbib}
\bibpunct{(}{)}{,}{a}{}{,}
\usepackage{amsmath, amssymb}
\usepackage{hyperref}
\hypersetup{colorlinks, citecolor=blue, linkcolor=blue, urlcolor=blue}
\usepackage[top=30mm, bottom=30mm, left=30mm, right=30mm]{geometry}
\usepackage{enumerate}
\usepackage{engord}

%% additional commands
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\pkg}[1]{\mbox{\textbf{#1}}}
\newcommand{\proglang}[1]{\mbox{\textsf{#1}}}


%%\VignetteIndexEntry{Variance Estimation of Indicators on Social Exclusion and Poverty using the R Package laeken}
%%\VignetteDepends{laeken}
%%\VignetteKeywords{social exclusion, poverty, indicators, variance estimation}
%%\VignettePackage{laeken}


\begin{document}


\title{Variance Estimation of Indicators on Social Exclusion and Poverty using the \proglang{R} Package \pkg{laeken}}
\author{Matthias Templ$^{1}$, Andreas Alfons$^{2}$}
\date{}

\maketitle

\setlength{\footnotesep}{11pt}
\footnotetext[1]{
  \begin{tabular}[t]{l}
  Zurich University of Applied Sciences\\
  E-mail: \href{mailto:matthias.templ@zhaw.ch}{matthias.templ@zhaw.ch}
  \end{tabular}
}
\footnotetext[2]{
  \begin{tabular}[t]{l}
  Erasmus School of Economics, Erasmus University Rotterdam\\
  E-mail: \href{mailto:alfons@ese.eur.nl}{alfons@ese.eur.nl}
  \end{tabular}
}


% change R prompt
<<echo=FALSE, results=hide>>=
options(prompt="R> ")
@


\paragraph{Abstract}
This vignette illustrates the application of variance estimation procedures to
indicators on social exclusion and poverty using the \proglang{R} package
\pkg{laeken}. To be more precise, it describes a general framework for
estimating variance and confidence intervals of indicators under complex
sampling designs. Currently, the package is focused on bootstrap approaches.
While the naive bootstrap does not modify the weights of the bootstrap samples,
a calibrated version allows to calibrate each bootstrap sample on auxiliary
information before deriving the bootstrap replicate estimate.


% ------------
% introduction
% ------------

\section{Introduction}

When point estimates of indicators are computed from samples, it is important
to also obtain variance estimates and confidence intervals in order to account
for variability due to sampling. Other sources of variability such as data
editing or imputation may need to be considered as well, but this is not
further discussed in this paper. While this vignette targets the topic of
variance and confidence interval estimation for the indicators on social
exclusion and poverty according to \citet{EU-SILC04, EU-SILC09}, the aim is not
to describe and evaluate the different approaches that have been proposed to
date. Instead, the aim is to present the functionality for the statistical
environment \proglang{R} \citep{RDev} implemented in the add-on package
\pkg{laeken} \citep{laeken}.

It should be noted that the basic design of the package, as well as standard
point estimation of the indicators on social exclusion and poverty, is
discussed in detail in vignette \code{laeken-standard} \citep{templ11a}. In
addition, vignette \code{laeken-pareto} \citep{alfons11a} presents more
sophisticated methods for point estimation of the indicators, which are less
influenced by outliers. Those documents can be viewed from within \proglang{R}
with the following commands:
<<eval=FALSE>>=
vignette("laeken-standard")
vignette("laeken-pareto")
@
Morover, a general introduction to package \pkg{laeken} is published as
\citet{alfons13b}.

The data basis for the estimation of the indicators on social exclusion and
poverty is the \emph{European Union Statistics on Income and Living Conditions}
(EU-SILC), which is an annual panel survey conducted in EU member states and
other European countries. Package \pkg{laeken} provides the synthetic example
data \code{eusilc} consisting of $14\,827$ observations from $6\,000$
households. Furthermore, the data were generated from Austrian EU-SILC survey
data from 2006 using the data simulation methodology proposed by
\citet{alfons11c} and implemented in the \proglang{R} package
\pkg{simPopulation} \citep{simPopulation}. The data set \code{eusilc} is used in the
code examples throughout the paper.
% -----
<<>>=
library("laeken")
data("eusilc")
@

The rest of the paper is organized as follows. Section~\ref{sec:variance}
presents the general wrapper function for estimating variance and confidence
intervals of indicators in package \pkg{laeken}. The naive and calibrated
bootstrap approaches are discussed in Sections~\ref{sec:naive}
and~\ref{sec:calib}, respectively. Section~\ref{sec:concl} concludes.


% ---------------
% general wrapper
% ---------------

\section{General wrapper function for variance estimation} \label{sec:variance}

The function \code{variance()} provides a flexible framework for estimating the
variance and confidence intervals of indicators such as the
\emph{at-risk-of-poverty rate}, the \emph{Gini coefficient}, the \emph{quintile
share ratio} and the \emph{relative median at-risk-of-poverty gap}. For a
mathematical description and details on the implementation of these indicators
in the \proglang{R} package \pkg{laeken}, the reader is referred to vignette
\code{laeken-standard} \citep{templ11a}. In any case, \code{variance()} acts as
a general wrapper function for computing variance and confidence interval
estimates of indicators on social exclusion and poverty with package
\pkg{laeken}. The arguments of function \code{variance()} are shown in the
following:
<<>>=
args(variance)
@

All these arguments are fully described in the \proglang{R} help page of
function \code{variance()}. The most important arguments are:
\begin{description}
  \item[inc:] the income vector.
  \item[weights:] an optional vector of sample weights.
  \item[breakdown:] an optional vector giving different domains in which
  variances and confidence intervals should be computed.
  \item[design:] an optional vector or factor giving different strata for
  stratified sampling designs.
  \item[data:] an optional \code{data.frame}. If supplied, each of the above
    arguments should be specified as a character string or an integer or
    logical vector specifying the corresponding column.
  \item[indicator:] an object inheriting from the class \code{"indicator"} that
    contains the point estimates of the indicator, such as \code{"arpr"} for
    the at-risk-of-poverty rate, \code{"qsr"} for the quintile share ratio,
    \code{"rmpg"} for the relative median at-risk-of-poverty gap, or
    \code{"gini"} for the Gini coefficient.
  \item[type:] a character string specifying the type of variance estimation
    to be used.  Currently, only \code{"bootstrap"} is implemented for
    variance estimation based on bootstrap resampling.
\end{description}

In the following sections, two bootstrap methods for estimating the variance
and confidence intervals of point estimates for complex survey data are
described. Furthermore, their application using the function \code{variance()}
from package \pkg{laeken} is demonstrated.


% ---------------
% naive bootstrap
% ---------------

\section{Naive bootstrap} \label{sec:naive}

Let $\boldsymbol{X} := (\boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{n})'$
denote a survey sample with $n$ observations and $p$ variables. Then the
\emph{naive bootstrap algorithm} for estimating the variance and confidence
interval of an indicator can be summarized as follows:
\begin{enumerate}
  \item Draw $R$ independent bootstrap samples $\boldsymbol{X}_{1}^{*}, \ldots,
  \boldsymbol{X}_{R}^{*}$ from $\boldsymbol{X}$.
  \item Compute the bootstrap replicate estimates $\hat{\theta}_{r}^{*} :=
  \hat{\theta}(\boldsymbol{X}_{r}^{*})$ for each bootstrap sample
  $\boldsymbol{X}_{r}^{*}$, $r = 1, \ldots, R$, where $\hat{\theta}$ denotes an
  estimator for a certain indicator of interest. Of course the sample weights
  always need to be considered for the computation of the bootstrap replicate
  estimates.
  \item Estimate the variance $V(\hat{\theta})$ by the variance of the $R$
  bootstrap replicate estimates:
  \begin{equation}
  \hat{V}(\hat{\theta}) := \frac{1}{R-1} \sum_{r=1}^{R} \left(
  \hat{\theta}_{r}^{*} - \frac{1}{R} \sum_{s=1}^{R} \hat{\theta}_{s}^{*}
  \right)^{2}.
  \end{equation}
  \item Estimate the confidence interval at confidence level $1 - \alpha$ by
  one of the following methods \citep[for details, see][]{davison97}:
  \begin{description}
    \item[Percentile method:] $\left[ \hat{\theta}_{((R+1)
    \frac{\alpha}{2})}^{*}, \hat{\theta}_{((R+1)(1-\frac{\alpha}{2}))}^{*}
    \right]$, as suggested by \cite{efron93}.
    \item[Normal approximation:] $\hat{\theta} \pm z_{1-\frac{\alpha}{2}} \cdot
    \hat{V}(\hat{\theta})^{1/2}$ with $z_{1-\frac{\alpha}{2}} = \Phi^{-1}(1 -
    \frac{\alpha}{2})$.
    \item[Basic bootstrap method:] $\left[ 2\hat{\theta} -
    \hat{\theta}_{((R+1)(1-\frac{\alpha}{2}))}^{*}, 2\hat{\theta} -
    \hat{\theta}_{((R+1)\frac{\alpha}{2})}^{*} \right]$.
  \end{description}
  For the percentile and the basic bootstrap method, $\hat{\theta}_{(1)}^{*}
  \leq \ldots \leq \hat{\theta}_{(R)}^{*}$ denote the order statistics of the
  bootstrap replicate estimates.
\end{enumerate}

In the following example, the variance and confidence interval of the
at-risk-of-poverty rate are estimated with the naive bootstrap procedure. The
output of function \code{variance()} is an object of the same class as the
point estimate supplied as the \code{indicator} argument, but with additional
components for the variance and confidence interval. In addition to the point
estimate, the income and the sample weights need to be supplied. Furthermore,
a stratified sampling design can be considered by specifying the \code{design}
argument, in which case observations are resampled separately within the
strata. To ensure reproducibility of the results, the seed of the random number
generator is set.
<<>>=
a <- arpr("eqIncome", weights = "rb050", data = eusilc)
variance("eqIncome", weights = "rb050", design = "db040",
    data = eusilc, indicator = a, bootType = "naive", seed = 123)
@

One of the most convenient features of package \pkg{laeken} is that indicators
can be evaluated for different subdomains using a single command. This also
holds for variance estimation. Using the \code{breakdown} argument, the example
below produces variance and confidence interval estimates for each NUTS2 region
in addition to the overall values.
<<>>=
b <- arpr("eqIncome", weights = "rb050", breakdown = "db040", data = eusilc)
variance("eqIncome", weights = "rb050", breakdown = "db040", design = "db040",
    data = eusilc, indicator = b, bootType = "naive", seed = 123)
@

It should be noted that the workhorse function \code{bootVar()} is called
internally by \code{variance()} for bootstrap variance and confidence interval
estimation. The function \code{bootVar()} could also be called directly by the
user in exactly the same manner. Moreover, variance and confidence interval
estimation for any other indicator implemented in package \pkg{laeken} is
straightforward---the application using function \code{variance()} or
\code{bootVar()} remains the same.


% --------------------
% calibrated bootstrap
% --------------------

\section{Calibrated bootstrap} \label{sec:calib}

\cite{rao88} showed that the naive bootstrap is biased when used in the complex
survey context. They propose to increase the variance estimate in the $h$-th
stratum by a factor of $\frac{n_{h} - 1}{n_{h}}$ (if the bootstrap sample is of
the same size). In addition, they describe extensions to sampling without
replacement, unequal probability sampling, and two-stage cluster sampling with
equal probabilities and without replacement.

\cite{deville92} and \cite{deville93} provide a general description on how to
calibrate sample weights to account for known population totals. The naive
bootstrap does not include the recalibration of bootstrap samples in order to
fit known population totals and therefore is, strictly formulated, not suitable
for many practical applications. However, even though a bias might be
introduced, the naive bootstrap works well in many situations and is faster to
compute than the calibrated version. Hence it is a popular method often used in
practice.

In real-world data, the inclusion probabilities for observations in the
population are in general not all equal, resulting in different \emph{design
weights} for the observations in the sample. Furthermore, the initial design
weights are in practice often adjusted by calibration, e.g., to account for
non-response or so that certain known population totals can be precisely
estimated from the survey sample. To give a simplified example, if the
population sizes in different regions are known, the sample weights may be
calibrated so that the Horvitz-Thompson estimates \citep{horvitz52} of the
population sizes equal the known true values. However, when bootstrap samples
are drawn from survey data, resampling observations has the effect that such
known population totals can no longer be precisely estimated. As a remedy, the
sample weights of each bootstrap sample should be calibrated.

The calibrated version of the bootstrap thus results in more precise variance
and confidence interval estimation, but comes with higher computational costs
than the naive approach. In any case, the \emph{calibrated bootstrap algorithm}
is obtained by adding the following step between Steps~1 and~2 of the naive
bootstrap algorithm from Section~\ref{sec:naive}:
\begin{itemize}
  \item[1b.] Calibrate the sample weights for each bootstrap sample
  $\boldsymbol{X}_{r}^{*}$, $r = 1, \ldots, R$. Generalized raking procedures
  are thereby used for calibration: either a multiplicative method known as
  \emph{raking}, an additive method or a logit method \citep[see][]{deville92,
  deville93}.
\end{itemize}

The function call to \code{variance()} for the calibrated bootstrap is very
similar to its counterpart for the naive bootstrap. A matrix of auxiliary
calibration variables needs to be supplied via the argument \code{X}. In
addition, the argument \code{totals} can be used to supply the corresponding
population totals. If the \code{totals} argument is omitted, as in the
following example, the population totals are computed from the sample weights
of the original sample. This follows the assumption that those weights are
already calibrated on the supplied auxiliary variables.
% -----
<<>>=
variance("eqIncome", weights = "rb050", design = "db040",
    data = eusilc, indicator = a, X = calibVars(eusilc$db040),
    seed = 123)
@
% -----
Note that the function \code{calibVars()} transforms a factor into a matrix of
binary variables, as required by the calibration function \code{calibWeights()},
which is called internally. While the default is to use raking for calibration,
other methods can be specified via the \code{method} argument.


% -----------
% conclusions
% -----------

\section{Conclusions} \label{sec:concl}
Both bootstrap procedures for variance and confidence interval estimation of
indicators on social exclusion and poverty currently implemented in the
\proglang{R} package \pkg{laeken} have their strengths. While the naive
bootstrap is faster to compute, the calibrated bootstrap in general leads to
more precise results. The implementation of other procedures such as
linearization techniques \citep{kovacevic97, deville99, hulliger06, osier09} or
the delete-a-group jackknife \citep{kott01} is future work.

Furthermore, \citet{alfons09} demonstrated how the variance of indicators
computed from data with imputed values may be underestimated in bootstrap
procedures, depending on the indicator itself and the imputation procedure
used. They proposed to use the method described in \cite{little02}, which
consists of drawing bootstrap samples from the original data with missing
values, and to impute the missing data for each bootstrap sample before
computing the corresponding bootstrap replicate estimate. Of course, this
results in an additional increase of the computation time. The implementation
of this procedure in package \pkg{laeken} is future work. It should also be
noted that multiple imputation is a further possibility to consider the
additional uncertainty from imputation when estimating the variance of an
indicator \citep[see][]{little02}.


% ---------------
% acknowledgments
% ---------------

\section*{Acknowledgments}
This work was partly funded by the European Union (represented by the European
Commission) within the 7$^{\mathrm{th}}$ framework programme for research
(Theme~8, Socio-Economic Sciences and Humanities, Project AMELI (Advanced
Methodology for European Laeken Indicators), Grant Agreement No. 217322). Visit
\url{http://ameli.surveystatistics.net} for more information on the project.


% ------------
% bibliography
% ------------

\bibliographystyle{plainnat}
\bibliography{laeken}

\end{document}