cplm.Rnw
%\VignetteIndexEntry{cplm}
% document class from R2WinBUGS
% To reduce checking time, we include model results directly as verbatim code
\documentclass{Z}
\usepackage{Rd,thumbpdf}
\usepackage{mathrsfs, amsmath, amsfonts, amssymb, amsthm}
\usepackage{bbm}
\usepackage{verbatim}
\usepackage{moreverb}
\usepackage[utf8]{inputenc}
\usepackage[margin = 1in]{geometry}
\setlength{\tabcolsep}{1.5pt}
\setlength{\textheight}{23.5cm}
\newcommand{\cplm}{\textbf{\texttt{cplm}} }
\newcommand{\bs}[1]{\boldsymbol{#1}}
\newcommand{\ind}{\mathbbm{1}}
\newcommand{\pr}{\mbox{pr}}
\renewcommand{\R}{{\proglang{R}}{}}
\hypersetup{
colorlinks = true, pagebackref = true,
pdftitle={cplm: Compound Poisson Linear Models},
pdfauthor={Wayne Zhang}
}
\SweaveOpts{engine = R, eps = FALSE, keep.source = TRUE}
<<options, echo = FALSE>>=
options(prompt = "R> ", digits = 4, show.signif.stars = FALSE)
@
\title{\cplm: Compound Poisson Linear Models}
\author{Yanwei (Wayne) Zhang \thanks{\email{actuary\_zhang@hotmail.com}}\\ Department of Marketing\\Marshall Business School\\University of Southern California
}
\Plainauthor{Wayne Zhang}
\Plaintitle{Compound Poisson Linear Models}
\Abstract{
The Tweedie compound Poisson distribution is a mixture of a degenerate distribution at the origin and a continuous distribution on the positive real line. It has been applied in a wide range of fields in which continuous data with exact zeros regularly arise. Nevertheless, statistical inference based on full likelihood and Bayesian methods is not available in most statistical software, largely because the distribution has an intractable density function and numerical methods that allow fast and accurate evaluation of the density did not appear until fairly recently. The \cplm package provides likelihood-based and Bayesian procedures for fitting common Tweedie compound Poisson linear models. In particular, models with hierarchical structures, penalized splines or extra zero inflation can be handled. Further, the package implements the Gini index based on an ordered version of the Lorenz curve as a robust model comparison tool involving zero-inflated and highly skewed distributions.
}
\Keywords{Bayesian, Gini, splines, maximum likelihood, mixed models, Tweedie compound Poisson distribution}
\begin{document}
% set default plot options
<<eval = TRUE, echo = FALSE>>=
mypar <- function(){
parold <- par(mar = c(3.5, 2.5, 2, 0.5), tck = -0.02,
mgp = c(1.3, 0.2, 0), cex.axis = 0.7, cex.lab = 0.7)
parold
}
# no margin in lattice
theme.novpadding <-
list(layout.heights =
list(top.padding = 0,
main.key.padding = 0,
key.axis.padding = 0,
axis.xlab.padding = 0,
xlab.key.padding = 0,
key.sub.padding = 0,
bottom.padding = 0),
layout.widths =
list(left.padding = 0,
key.ylab.padding = 0,
ylab.axis.padding = 0,
axis.key.padding = 0,
right.padding = 1))
@
\SweaveOpts{concordance=TRUE}
\textcolor{Red}{{\Large If you use this software, please cite:}\\[2ex]
Zhang, Y (2013). Likelihood-based and Bayesian Methods for Tweedie Compound Poisson Linear Mixed Models, {\em Statistics and Computing}, 23, 743-757.} \\[1ex]
\url{https://github.com/actuaryzhang/cplm/files/144051/TweediePaper.pdf}
\section{Introduction}
In many fields, the data of interest have a continuous distribution on positive values but often with some observation being exact zeros. One standard method in analyzing such data is using a special form of the compound distribution, in which the response is assumed to be generated as a random sum of individual random variables with a positive support. That is, $Y = \sum_{i}^T X_i$, where $T$ is the number of events following a discrete count distribution and $X_i$ is the magnitude/severity of the outcome in the $i_{th}$ event. When $T = 0$, we have $Y = 0$, thereby allowing the distribution to have a probability mass at the origin. Indeed, for certain applications, there is some intuitive appeal to exploit this distribution, because the underlying data can be considered as generated by a compound process. For example,
\begin{itemize}
\item in actuarial science, $Y$ is the aggregate claim amount for a covered risk, $T$ the number of reported claims and $X_i$ the insurance payment for the $i_{th}$ claim;
\item in rainfall modeling, $Y$ is the total amount of precipitation in a given period, $T$ the number of rain events and $X_i$ the intensity of the precipitation for the $i_{th}$ rain event.
\item in ecological studies of stock abundance, $Y$ is the total biomass in a certain area, $T$ the number of patches of organisms and $X_i$ the measured biomass for the $i_{th}$ patch.
\end{itemize}
Of special interest is the Tweedie compound Poisson distribution (henceforth the compound Poisson distribution) in which the distributions of $T$ and $X_i$ are specified as:
\begin{equation}
\label{cpois}
T \sim Pois(\lambda), \; X_i \stackrel{\mathrm{iid}}{\sim} Gamma(\alpha, \gamma), \; T \perp X_i.
\end{equation}
For a given $\alpha$, this distribution can be expressed in the form of the exponential dispersion model \citep{Jorgensen:1987} with a power variance function $V(\mu) = \mu^p$, where the power index $p = (\alpha + 2)/(\alpha + 1) \in (1, 2)$. As a result, estimation procedures developed for the exponential dispersion model are directly applicable to the compound Poisson distribution. Indeed, applications of this particular compound distribution, primarily in the form of generalized linear models [GLM], have been found in actuarial science \citep{Smyth;Jorgensen:2002}, animal studies \citep{Perry:1981}, assay analysis \citep{Davidian:1990}, botany studies \citep{Dunn;Smyth:2005}, survival analysis \citep{Hougaard;Harvald;Holm:1992}, rainfall modeling \citep{Dunn:2004} and fishery research \citep{Shono:2008}.
An often neglected part of the analysis is the estimation of the unknown variance function, i.e., the index parameter $p$. This parameter has a significant impact on hypothesis tests and predictive uncertainty measures \citep{Davidian;Carroll:1987, Peters;Shevchenko;Wuthrich:2009, Zhang:2012}, which is of independent interest in many applications. One approach in estimating the variance function is using the profile likelihood \citep{Cox;Reid:1987}. For the compound Poisson distribution, such an approach must be implemented based on the true likelihood rather than the extended quasi-likelihood \citep{Nelder;Pregibon:1987}. The extended quasi-likelihood is not well suited to this task in that it involves a term $\log(V(y))$ which becomes infinite for $y=0$, and thus requires adding a small positive constant to the observed zeros which, unfortunately, is highly influential on parameter estimation.
Like most other compound distributions, the density function of the compound Poisson distribution is not analytically tractable. But unlike other compound distributions whose density must be approximated via the slow recursive approach \citep{Klugman;Panjer;Willmot:2008}, methods that enable fast and accurate numerical evaluation of compound Poisson density function are available \citep{Dunn;Smyth:2005, Dunn;Smyth:2008}. These methods are provided by the \texttt{tweedie} package \citep{tweedie}. With the density approximation methods, we can carry out not only maximum likelihood estimation but also Bayesian inference using Markov chain Monte Carlo methods \citep{Gelman:2003}.
The \cplm package provides both likelihood-based and Bayesian methods for various compound Poisson linear models. The following features of the package may be of special interest to the users:
\begin{enumerate}
\item All methods available in the package enable the index parameter (i.e., the unknown variance function) to be estimated from the data.
\item The compound Poisson generalized linear model handles large data set using the bounded memory regression facility in \texttt{biglm}.
\item For mixed models, we provide likelihood-based methods using Laplace approximation and adaptive Gauss-Hermit quadrature.
\item A convenient interface is offered to fit additive models (penalized splines) using the mixed model estimation procedure.
\item Self-tuned Markov chain Monte Carlo procedures are available for both GLM-type and mixed models.
\item The package also implements a zero-inflated compound Poisson model, in which the observed frequency of zeros can generally be more adequately modeled.
\item We provide the Gini index based on an ordered Lorenz curve, which is better suited for model comparison involving the compound Poisson distribution.
\end{enumerate}
\section{Models and Examples}
\subsection{Overview}
Table \ref{tab:cplm} shows the mapping of the user-visible functions to the underlying model structures they handle. Currently, all Bayesian methods are integrated in the \texttt{bcplm} function and Bayesian estimation of the zero-inflated model is not yet implemented. A call of each function creates an object of a class that has the same name as the function. The \cplm package is written using S4 classes and methods \citep{Chambers:1998}. Most of the user-visible classes are derived from a parent class \texttt{"cplm"}, which specifies common slots used in the package. Several methods are defined for this class, some of which allow users to manipulate attributes and slots of an S4 object in a similar way as a list. For example,
<<>>=
library(cplm)
obj <- new("cplm") # create an object of class "cplm"
names(obj) # or slotNames(obj)
obj$inits # or obj@inits
obj[["inits"]] # or slot(obj, "inits")
@
To find out what methods are available for a particular class, say \texttt{"cpglm"}, use the following:
<<eval = FALSE>>=
showMethods(classes = "cpglm")
@
The help documentation of the available classes and methods can be assessed in the following fashion:
<<eval = FALSE>>=
class ? bcplm
method ? resid("cpglm") # the "resid" method for class "cpglm"
method ? resid("cpglmm")
@
\begin{table}[t]
\begin{center}
\setlength{\tabcolsep}{6pt}
\begin{tabular}{rcc}
\hline\hline
Model & Likelihood-based & Bayesian \\
\hline
GLM-type Models & \texttt{cpglm} & \texttt{bcplm}\\
Mixed Models & \texttt{cpglmm} & \texttt{bcplm} \\
Additive Models & \texttt{cpglmm} & \texttt{bcplm} \\
Zero-inflated Models & \texttt{zcpglm} & N/A \\
\hline\hline
\end{tabular}
\caption{Correspondence between models and functions in \cplm.}
\label{tab:cplm}
\end{center}
\end{table}
\subsection{Generalized linear models}
Given the index parameter $p$, the Tweedie compound Poisson distribution can be written in the form of the exponential dispersion model with $E(Y) = \mu$, $Var(Y) = \phi \cdot V(\mu) =\phi \cdot \mu^p$, where $\phi > 0 $ is the dispersion parameter. Using this representation, we denote the compound Poisson distribution as $Y \sim CPois(\mu, \phi, p)$.
In the following, we assume the observed response vector is $\bs{Y} = (Y_1, \cdots, Y_n)'$, where $Y_i \sim CPois(\mu_i, \phi, p)$. In the compound Poisson GLM, the mean $\bs{\mu} = E(\bs{Y})$ is stipulated as a function of some linear predictors through a link function $\eta$ as
\begin{align}
\label{glm_mu}
\eta(\bs{\mu}) = \bs{X\beta},
\end{align}
where $\bs{X}$ is the design matrix and $\bs{\beta}$ is the vector of fixed effects.
If the index parameter $p$ is known, the compound Poisson GLM can be estimated using the Fisher's scoring algorithm \citep{McCullagh;Nelder:1989}. In \R, this can be easily fitted using \texttt{glm} with \texttt{family = tweedie()}. For unknown $p$, parameter estimation can be proceeded using the profile likelihood approach \citep{Cox;Reid:1987}. We denote $\bs{\sigma} = (\phi, p)'$. Because for a given value of $\bs{\sigma}$, the maximum likelihood estimation $\hat{\bs{\beta}}(\bs{\sigma})$ can be determined using the scoring algorithm, we can profile $\bs{\beta}$ out of the likelihood and maximize the profile likelihood to estimate $\bs{\sigma}$ as
\begin{align}
\hat{\bs{\sigma} } =\arg \max\limits_{\bs{\sigma} } \ell(\bs{\sigma} | \bs{y}, \hat{\bs{\beta}}(\bs{\sigma} )),
\end{align}
where $\ell(\bs{\sigma} | \bs{y}, \hat{\bs{\beta}}(\bs{\sigma} ))$ is the loglikelihood evaluated at $\hat{\bs{\beta}}(\bs{\sigma})$ and $\bs{\sigma}$. This likelihood must be approximated using the compound Poisson density evaluation methods provided by \cite{Dunn;Smyth:2005, Dunn;Smyth:2008} . The approximated profile loglikelihood is optimized numerically to locate the estimate $\hat{\bs{\sigma}}$, subject to the constraints $\phi >0 $ and $p \in (1, 2)$. The estimate of the regression parameters is then $\hat{\bs{\beta}}(\hat{\bs{\sigma}})$.
The profile likelihood approach is implemented by the \texttt{cpglm} function. Indeed, the function \texttt{tweedie.profile} in the \texttt{tweedie} package also makes available the profile likelihood approach. The \texttt{cpglm} function here differs from \texttt{tweedie.profile} in two aspects. First, the user does not need to specify the grid of possible values the index parameter can take. Rather, the optimization of the profile likelihood is automated. Second, big data sets can be handled where the \texttt{bigglm} function from the \texttt{biglm} package \citep{biglm} is used in fitting GLMs. The \texttt{bigglm} function is invoked when the argument \texttt{chunksize} is greater than 0.
We use the auto insurance claim data from \cite{Yip;Yau:2005} as an example to illustrate the use of the \texttt{cpglm} function. We first load and transform the data:
<<eval = FALSE>>=
da <- subset(AutoClaim, IN_YY == 1) # use data in the Yip and Yau paper
da <- transform(da, CLM_AMT5 = CLM_AMT5/1000,
INCOME = INCOME/10000)
@
We are interested at the variable \texttt{CLM\_AMT5}, which is the aggregate insurance claim amount from an auto insurance policy. A summary of this distribution shows that it has a spike at zero (61\%) while the rest follows a continuous distribution on positive values:
<<eval = FALSE>>=
summary(da$CLM_AMT5); sum(da$CLM_AMT5 == 0)/nrow(da)
@
\begin{Verbatim}
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 4.112 4.740 57.040
[1] 0.6066856
\end{Verbatim}
We would like to build a model that is predictive of the insurance loss based on some policy-related and demographic variables. For example, we include indicators of whether the car is for commercial use (\texttt{CAR\_USE}), whether the policyholder is married (\texttt{MARRIED}), and whether the car is driven in the urban area (\texttt{AREA}), as well as a continuous covariate- the MVR violation points (\texttt{MVR\_PTS}).
<<eval = FALSE>>=
P1 <- cpglm(CLM_AMT5 ~ CAR_USE + MARRIED + AREA + MVR_PTS, data = da)
summary(P1)
@
\begin{Verbatim}
Call:
cpglm(formula = CLM_AMT5 ~ CAR_USE + MARRIED + AREA + MVR_PTS,
data = da)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.4951 -2.5154 -1.9716 -0.1866 12.5669
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05648 0.14710 0.384 0.701
CAR_USECommercial 0.12521 0.09299 1.347 0.178
MARRIEDYes -0.14730 0.08985 -1.639 0.101
AREAUrban 1.00957 0.14135 7.142 1.16e-12 ***
MVR_PTS 0.21683 0.01738 12.474 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimated dispersion parameter: 7.1348
Estimated index parameter: 1.4021
Residual deviance: 18332 on 2807 degrees of freedom
AIC: 10804
Number of Fisher Scoring iterations: 6
\end{Verbatim}
The \texttt{cpglm} function has the same syntax as the \texttt{glm} function except that there is no \texttt{family} argument. Desired link functions can be directly specify through the \texttt{link} argument, which uses the log link by default. The output shows the estimates for the index and the dispersion parameter in addition to those for the fixed effects. These coefficients can also be extracted directly from the object using
<<eval = FALSE>>=
c(coef(P1), p = P1$p, phi = P1$phi)
@
Similarly, most utility extraction methods defined for a linear model object such as \texttt{fitted}, \texttt{resid}, \texttt{model.matrix} and so on can be used.
By default, the optimization of the profile likelihood in \texttt{cpglm} calls the \texttt{glm} function in estimating the fixed effects. However, it is known that the \texttt{glm} function can run into memory issues for large data sets. To overcome this problem, the \texttt{cpglm} function implements an alternative method that relies on the bounded memory regression provided by the \texttt{biglm} package \citep{biglm}. The \texttt{bigglm} function divides the data frame into chunks and regression estimates are updated chunk by chunk. This method is invoked when the argument \texttt{chunksize} is greater than 0, which specifies the size of chunks for processing the data frame.
Most arguments in the \texttt{bigglm} function can be directly passed to \texttt{cpglm} through the \texttt{...} mechanism. When using this functionality, the users are warned that
\begin{itemize}
\item when factors are present, the levels of each factor must be the same across all data chunks;
\item \texttt{bigglm} is substantially slower than \texttt{glm} when applied on the same data.
\end{itemize}
As an example, we can refit the previous model using \texttt{bigglm} with chunks of 500 data points as follows:
<<eval = FALSE>>=
P1.big <- cpglm(CLM_AMT5 ~ CAR_USE + MARRIED + AREA + MVR_PTS,
data = da, chunksize = 500)
@
\subsection{Mixed models}
In the mixed model, the mean of the response is specified as
\begin{align}
\eta(\bs{\mu}) &= \bs{X} \bs{\beta} + \bs{Zb}, \\
\bs{b} &\sim N(\bs{0}, \bs{\Sigma}),
\end{align}
where $\bs{\beta}$ and $\bs{b}$ are the vector of fixed and random effects, $\bs{X}$ and $\bs{Z}$ are the associated design matrices and $\bs{\Sigma}$ is the variance component. The goal is to make inference of the parameters $(\bs{\beta}, \phi, p, \bs{\Sigma})$. The \cplm package implements maximum likelihood estimation in the mixed model, in which the underlying marginal likelihood can be derived as:
\begin{equation}
p(\bs{y}|\bs{\beta}, \phi, p, \bs{\Sigma}) = \int p(\bs{y}|\bs{\beta}, \phi, p, \bs{b}) \cdot p(\bs{b}| \bs{\Sigma}) d\bs{b}.
\end{equation}
The above integral is often intractable, but can be approximated using the Laplace method or the adaptive Gauss-Hermite quadrature approach. The implementation of these methods is based on the old \texttt{lme4} package (version < 1.0), with changes made on the updating of the mean, the variance function and the marginal loglikelihood. For the Laplace method, the contribution of the dispersion parameter to the approximated loglikelihood is explicitly accounted for, which should be more accurate and more consistent with the quadrature estimate. Indeed, both the dispersion parameter and the index parameter are included as a part of the optimization process. In computing the marginal loglikelihood, the density of the compound Poisson distribution is approximated using numerical methods provided by the \texttt{tweedie} package \citep{tweedie}. For details of the Laplace approximation and adaptive Gauss-Hermite quadrature methods for generalized linear mixed models, see the documentation associated with \texttt{lme4}.
We use the fine root data set \citep{Silva;Hall;Tustin;Gandarl:1999} to illustrate various features of the \texttt{cpglmm} function. We are interested at how the distribution of the fine root length density [RLD] is affected by the geometry of the structural root system and the type of the root stock.
<<>>=
head(FineRoot)
@
Following \cite{Zhang:2012}, we specify a model that includes both main effects of stock and zone and their interactions, with the intercept varying by plant:
<<echo = FALSE>>=
load("cpglmm.RData")
@
<<eval = FALSE>>=
f0 <- cpglmm(RLD ~ Stock * Zone + (1|Plant), data = FineRoot)
summary(f0)
@
\begin{Verbatim}
Compound Poisson linear mixed model fit by the Laplace approximation
Formula: RLD ~ Stock * Zone + (1 | Plant)
Data: FineRoot
AIC BIC logLik deviance
-172.5 -138.6 94.27 -188.5
Random effects:
Groups Name Variance Std.Dev.
Plant (Intercept) 0.0077274 0.087906
Residual 0.3286284 0.573261
Number of obs: 511, groups: Plant, 8
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.09823 0.16528 -12.695
StockMM106 -0.06637 0.21888 -0.303
StockMark -0.46346 0.20234 -2.290
ZoneOuter -0.44690 0.25546 -1.749
StockMM106:ZoneOuter 0.02563 0.31241 0.082
StockMark:ZoneOuter -1.16566 0.32468 -3.590
Estimated dispersion parameter: 0.3286
Estimated index parameter: 1.4131
\end{Verbatim}
The output is similar to that from a call of \texttt{glmer}, but reports additionally the estimates of the dispersion and the index parameters.
<<>>=
inherits(f0, "mer")
@
As a result, most of the methods defined in (the old) \texttt{lme4} are directly applicable:
<<>>=
fixef(f0) #coefficients
ranef(f0)
VarCorr(f0) #variance components
@
Models with crossed and nested random effects can also be handled. We illustrate this with a slightly different model structure (we explicitly convert plant to a factor because this is required when nested effects are present):
<<eval = FALSE>>=
FineRoot$Plant <- factor(FineRoot$Plant)
f1 <- cpglmm(RLD ~ Stock + Spacing + (1|Plant), data = FineRoot)
@
We can add another random effect from zone as crossed or nested within plant as follows:
<<eval = FALSE>>=
f2 <- update(f1, . ~ . + (1|Zone))
f3 <- update(f1, . ~ . + (1|Plant:Zone))
# test the additional random effect
anova(f1, f2)
@
\begin{Verbatim}
Data: FineRoot
Models:
f1: RLD ~ Stock + Spacing + (1 | Plant)
f2: RLD ~ Stock + Spacing + (1 | Plant) + (1 | Zone)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
f1 6 -107.78 -82.367 59.893
f2 7 -145.58 -115.924 79.789 39.794 1 2.823e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
\end{Verbatim}
<<eval = FALSE>>=
anova(f1, f3)
@
\begin{Verbatim}
Data: FineRoot
Models:
f1: RLD ~ Stock + Spacing + (1 | Plant)
f3: RLD ~ Stock + Spacing + (1 | Plant) + (1 | Plant:Zone)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
f1 6 -107.78 -82.367 59.893
f3 7 -147.04 -117.390 80.523 41.26 1 1.333e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
\end{Verbatim}
In addition, the \texttt{cpglmm} function allows users to choose alternative optimizers that may perform better in certain circumstances. The \texttt{optimizer} argument determines which optimization routine is to be used. Possible choices are \texttt{"nlminb"} (the default), \texttt{"bobyqa"} and \texttt{"L-BFGS-B"}. We can refit the first example with a different optimizer as follows:
<<eval = FALSE>>=
f4 <- cpglmm(RLD ~ Stock * Zone + (1|Plant),
data = FineRoot, optimizer = "L-BFGS-B",
control = list(trace = 2, PQL.init = FALSE))
@
\begin{Verbatim}
N = 9, M = 5 machine precision = 2.22045e-16
iterations 21
function evaluations 27
segments explored during Cauchy searches 23
BFGS updates skipped 0
active bounds at final generalized Cauchy point 0
norm of the final projected gradient 0.0424422
final function value -188.534
final value -188.534323
converged
\end{Verbatim}
By default, the \texttt{cpglmm} function implements a penalized quasi-likelihood procedure before optimizing the approximated likelihood. In the above, we have specified \texttt{PQL.init = FALSE} to skip that step.
The default fitting method is the Laplace approximation. The quadrature method can be invoked by setting \texttt{nAGQ} to be greater than 1, which specifies the number of quadrature points to be used for the numerical evaluation. Note that this method only works when there is one single grouping factor in the model:
<<eval = FALSE>>=
f5 <- cpglmm(RLD ~ Stock * Zone + (1|Plant),
data = FineRoot, nAGQ = 10)
@
\subsection{Additive models}
Further, similar to the \texttt{amer} package \citep{amer}, we provide convenient interfaces for fitting additive models using penalized splines \citep{Ruppert;Wand;Carroll:2003}. All the spline constructors available in \texttt{amer} or those defined by the users following the \texttt{amer} requirement can be directly used here.
For example, we update the auto insurance loss example by specifying a smooth effect on the continuous covariate \texttt{MVR\_PTS}.
<<eval = FALSE>>=
P2 <- cpglmm(CLM_AMT5 ~ CAR_USE + MARRIED + AREA + tp(MVR_PTS), data = da)
@
\begin{figure}[t]
\centering
\includegraphics[angle = 0, scale = 0.75]{SmoothPlot}
\caption{Plot of the smooth term on the linear predictor scale. }
\label{fig:smooth}
\end{figure}
We can extract and plot the smooth terms using the \texttt{getF} and \texttt{plotF} function from \texttt{amer}. For example, \texttt{plotF} generates the graph of the smooth term on the linear predictor scale as in Figure \ref{fig:smooth}.
<<eval = FALSE>>=
plotF(P2, rug = FALSE)
@
% generate plots
%pdf("SmoothPlot.pdf", 2.5, 2.5)
%parold <- mypar()
%plotF(P2, rug = FALSE, mgp = c(1.3, 0.2, 0))
%par(parold)
%dev.off()
\subsection{Bayesian methods}
The above models can also be formulated in the Bayesian setting, in which parameter inference is carried out via Markov chain Monte Carlo [MCMC] methods.
In the Bayesian model, prior distributions have to be specified for all parameters in the model. The prior distributions we use are
\begin{align}
\bs{\beta} &\sim N(\bs{0}, 10000 \cdot \bs{I}); \\
\phi &\sim U(0, 100); \\
p &\sim U(1.01, 1.99),
\end{align}
where the prior mean and variance of $\bs{\beta}$ and the bounds for $\phi$ and $p$ can be modified using the argument \texttt{prior.beta.mean}, \texttt{prior.beta.var}, \texttt{bound.phi} and \texttt{bound.p}, respectively.
If a mixed model is specified, we specify the prior distribution of the variance component as follows. If there is one random effect in a group, the inverse Gamma distribution is used; and if there is more than one random effects in a group, the inverse Wishart distribution is used:
\begin{align}
\sigma_b^{-2} &\sim Gamma(0.001, 1000); \\
\bs{\Sigma}^{-1} &\sim Wishart_d(\bs{I}),
\end{align}
where $d$ is the dimension of $\bs{\Sigma}$. Currently, the parameter values in these prior distributions cannot be modified by the user.
In implementing the MCMC, we construct a Gibbs sampler in which parameters are updated one at a time given the current values of all the other parameters. Specifically, we use the random-walk Metropolis-Hastings algorithm in updating each parameter except for the variance components, which can be simulated directly due to conjugacy.
There is a tuning process in the MCMC algorithm, where the variances of the proposal (truncated) normal distributions are updated according to the sample variances computed from the simulations. The goal is to make the acceptance rate roughly between 40\% and 60\% for univariate M-H updates. The argument \texttt{tune.iter} determines how many iterations are used for the tuning process, and \texttt{n.tune} determines how many loops the tuning iterations should be divided into. These iterations will not be used in the final output.
The \texttt{bcplm} function is used to fit both GLMs and mixed models using MCMC. We illustrate the Bayesian GLM using the claim triangle data from \cite{Peters;Shevchenko;Wuthrich:2009}, in which the Tweedie compound Poisson distribution is specified for a strictly positive but highly skewed response variable \texttt{increLoss}.
<<eval = FALSE>>=
set.seed(10)
B1 <- bcplm(increLoss ~ factor(year) + factor(lag), data = ClaimTriangle,
n.chains = 2, n.iter = 7000, tune.iter = 4000,
n.burnin = 2000, n.thin = 5, bound.p = c(1.1, 1.95))
@
In the above, we specify the lower bound of $p$ to be greater than $1.1$. This is often a safe guard to regions where the compound Poisson distribution is multi-modal \citep{Dunn;Smyth:2005}. By default, the function prints out intermediate acceptance information of the M-H algorithms, which is often useful to determine whether good proposal variances have been obtained in the tuning phase. For example, a run of the above function prints out the following:
\begin{Verbatim}
Tuning phase...
Acceptance rate: min(34.00%), mean(51.62%), max(63.00%)
-----------------------------------------
Start Markov chain 1
Iteration: 3500
Acceptance rate: min(41.63%), mean(48.60%), max(55.14%)
Iteration: 7000
Acceptance rate: min(42.64%), mean(48.63%), max(55.27%)
-----------------------------------------
Start Markov chain 2
Iteration: 3500
Acceptance rate: min(42.60%), mean(49.41%), max(57.60%)
Iteration: 7000
Acceptance rate: min(42.59%), mean(49.33%), max(57.31%)
-----------------------------------------
Markov Chain Monte Carlo ends!
\end{Verbatim}
The simulation results are saved in the \texttt{sims.list} slot, which is an \texttt{"mcmc.list"} object that works with many methods defined in the \texttt{coda} package \citep{coda}. For example, we can use the \texttt{gelman.diag} function to compute the `potential scale reduction factor' \citep{Gelman;Rubin:1992} in order to determine approximate convergence:
<<eval = FALSE>>=
summary(gelman.diag(B1$sims.list)[[1]][, 1])
@
\begin{Verbatim}
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9992 1.0000 1.0030 1.0040 1.0080 1.0220
\end{Verbatim}
In addition, we can also use visual diagnostic tools such as the trace and density plots- Figure \ref{fig:trace} and \ref{fig:density} are generated by the following commands:
<<eval = FALSE>>=
xyplot(B1$sims.list[, c(1:2, 20, 21)], xlab = NULL)
densityplot(B1$sims.list[, c(1:2, 20, 21)], ylab = NULL)
@
% pre-generate plot
%pdf("trace.pdf", 2.5, 3)
%print(xyplot(B1$sims.list[, c(1:2, 20, 21)], scales = list(cex = 0.4),
% par.strip.text = list(cex = 0.4), par.settings = %theme.novpadding,
% xlab = NULL))
%dev.off()
%pdf("density.pdf", 2.5, 1.6)
%print(densityplot(B1$sims.list[, c(1:2, 20, 21)], scales = list(cex = 0.4),
% par.strip.text = list(cex = 0.4), par.settings = theme.novpadding,
% ylab = NULL))
%dev.off()
\begin{figure}[p]
\centering
\includegraphics [angle = 0, scale = 0.75]{trace.pdf}
\caption{Trace plot of the two chains in the claim triangle example. }
\label{fig:trace}
\includegraphics{density}
\caption{Density plot of the two chains in the claim triangle example. }
\label{fig:density}
\end{figure}
We see that despite that all the potential scale reduction factors are very close to 1, these diagnostic plots suggest that better mixing can be achieved if the chains are run longer.
We conclude this section with a Bayesian example on compound Poisson mixed models. We now use MCMC methods to draw inference of the fine root mixed model (\texttt{f0}). The estimates of the variance components in Bayesian models are generally less biased than those based on maximum likelihood procedures when there are few levels in the grouping factor \citep{Zhang:2012}.
<<eval = FALSE>>=
set.seed(10)
B2 <- bcplm(RLD ~ Stock * Zone + (1|Plant),
data = FineRoot, n.iter = 11000,
n.burnin = 1000, n.thin = 10)
@
The inference results can be summarized by
<<eval = FALSE>>=
summary(B2)
@
\begin{Verbatim}
Compound Poisson linear models via MCMC
3 chains, each with 11000 iterations (first 1000 discarded), n.thin = 10
n.sims = 3000 iterations saved
Formula: RLD ~ Stock * Zone + (1 | Plant)
Data: FineRoot
Random and dynamic variance components:
Groups Name Variance Std.Dev.
Plant (Intercept) 0.024784 0.15743
Residual 0.337777 0.58119
Number of obs: 511 , groups: Plant, 8
Fixed effects:
Estimate Std. Error Lower (2.5%) Upper (97.5%)
(Intercept) -2.08925 0.21928 -2.52060 -1.626
StockMM106 -0.07313 0.29603 -0.71670 0.505
StockMark -0.47743 0.27231 -1.04190 0.041
ZoneOuter -0.45524 0.25598 -0.96740 0.036
StockMM106:ZoneOuter 0.03198 0.30952 -0.56080 0.646
StockMark:ZoneOuter -1.15131 0.32381 -1.76360 -0.505
---
Estimated dispersion parameter: 0.33778
Estimated index parameter: 1.4178
\end{Verbatim}
These results are in line with those in \cite{Zhang:2012}. We see that the estimate of the variance component is about twice as large as that based on maximum likelihood estimation.
\subsection{Zero-inflated compound Poisson models}
There has been a rich literature on zero-inflated count models to account for excess zeros that cannot be modeled by a regular count distribution \citep{Lambert:1992, Hall:2000, Ghosh:2006}.
These models supplement a standard count distribution by allowing the observation to come from a degenerate distribution at zero with a positive probability. For example, the zero-inflated Poisson \citep{Lambert:1992} assumes that, for the $i_{th}$ observation of the count data, $T_i$,
\begin{align}
T_i \sim
\begin{cases}
0 & \; \mbox{with probability} \; q_i,\\
Pois(\lambda_i) & \; \mbox{with probability} \; 1 - q_i.
\end{cases}
\end{align}
This has the effect of inflating the probability of zeros and the variance for a regular Poisson distribution.
In this regard, we can replace the inherent Poisson distribution in \eqref{cpois} by the above zero-inflated Poisson distribution. This essentially yields a model with two parts: the zero-inflation part that generates the excess zeros with probability $q_i$ and the compound Poisson part that generates
the random outcome from the compound Poisson process. That is, the response variable $Y_i$ in the zero-inflated compound Poisson [ZICP] model is generated as follows:
\begin{align}
\label{zicp}
Y_i \sim
\begin{cases}
0 & \; \mbox{with probability} \; q_i,\\
CPois(\mu_i, \phi, p) & \; \mbox{with probability} \; 1 - q_i.
\end{cases}
\end{align}
Under this assumption, the probability of observing a zero is
\begin{align}
Pr(Y_i = 0) = q_i + (1 - q_i) \cdot \exp \left(-\frac{ \mu_i^{2 - p}}{\phi (2 - p)}\right),
\end{align}
where the exponent term is the probability of zero from the compound Poisson distribution (that is, $\exp(-\lambda)$ using the Poisson-Gamma representation in \eqref{cpois}).
Further, we allow covariates to be incorporated in both parts such that the parameters $\bs{q} = (q_1, \cdots, q_n)'$ and $\bs{\mu} = (\mu_1, \cdots, \mu_n)'$ satisfy
\begin{align}
\varphi(\bs{q}) = \bs{G\gamma}, \quad \eta(\bs{\mu}) = \bs{X\beta},
\end{align}
where $\varphi$ and $\eta$ are the known link functions, $\bs{G}$ and $\bs{X}$ the design matrices, and $\bs{\gamma}$ and $\bs{\beta}$ the vectors of regression parameters.
We denote all the parameters in the model as $\bs{\Theta} = (\bs{\gamma}', \bs{\beta}', \phi, p)'$, and $g$ as the density function for the compound Poisson variable. Then, the loglikelihood, written as a function of the parameters, can be derived as
\begin{align}
\ell(\bs{\Theta} | \bs{y}) = \sum_{i = 1}^n \log \left[ q_i \cdot \ind{(y_i = 0)} + (1 - q_i) \cdot g(y_i|\mu_i, \phi, p) \right].
\end{align}
This loglikelihood is maximized, subject to the constraints $\phi > 0$ and $p \in (1, 2)$, to find estimates of $\bs{\Theta}$.
Unlike existing models, ZICP is a model that works on zero-inflated continuous data. One application of the ZICP model is to investigate insurance claim underreporting, a phenomenon known as the ``hunger for bonus''. It means that in the no claims discount system many insurance companies adopt, the insureds are motivated not to report small claims in exchange for bonuses or discounts on future premiums, thus often resulting in structural zeros in the data. The zero-inflation part of the model enables one to explore the determinants of the claim underreporting behavior by examining the relevant factors. For example, we can fit a zero-inflated model to the auto insurance data as follows:
<<eval = FALSE>>=
P3 <- zcpglm(CLM_AMT5 ~ CAR_USE + MARRIED + AREA + MVR_PTS||
MVR_PTS + INCOME, data = da)
summary(P3)
@
\begin{Verbatim}
Call:
zcpglm(formula = CLM_AMT5 ~ CAR_USE + MARRIED + AREA + MVR_PTS ||
MVR_PTS + INCOME, data = da)
Zero-inflation model coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.58817 0.11787 4.99 6.04e-07 ***
MVR_PTS -0.77592 0.05341 -14.53 < 2e-16 ***
INCOME 0.04602 0.01420 3.24 0.00119 **
Compound Poisson model coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.33670 0.12949 10.323 < 2e-16 ***
CAR_USECommercial 0.03984 0.06764 0.589 0.556
MARRIEDYes -0.09394 0.06649 -1.413 0.158
AREAUrban 0.62814 0.11583 5.423 5.87e-08 ***
MVR_PTS 0.05418 0.01379 3.929 8.52e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimated dispersion parameter: 4.5116
Estimated index parameter: 1.4445
\end{Verbatim}
In the formula specification, we use \texttt{"||"} to separate the two models - regressors in the compound Poisson part are before \texttt{"||"} and covariates in the zero-inflation part are after \texttt{"||"}. In addition, offsets can be specified in both components of the model, e.g., \\ \verb=y ~ x1 + offset(x2) | z1 + z2 + offset(z3)=.
The zero-inflation part of the result indicates that policyholders with higher income or lower MVR points are more likely to underreport insurance claims. Indeed, the zero-inflation part allows one to more adequately model the patterns in the observed frequency of zeros than the GLM-type model. For example, Figure \ref{fig:zcpglm} shows the observed frequency of zeros and the predicted probability of zeros from the \texttt{cpglm} and \texttt{zcpglm} for each value of \texttt{MVR\_PTS} .
<<eval = FALSE>>=
# p(y = 0) in cpglm
da$prob0 <- exp(-fitted(P1)^(2 - P1$p)/(P1$phi * (2 - P1$p)))
# p(y = 0) in zcpglm
q <- predict(P3, da, type = "zero")
mut <- predict(P3, da, type = "tweedie")
da$prob1 <- q + (1 - q) * exp(-mut^(2 - P3$p) / (P3$phi * (2- P3$p)))
# plot
da$MVR_BIN <- Hmisc::cut2(da$MVR_PTS, g = 8, levels.mean = TRUE)
tmp <- plyr::ddply(da, c("MVR_BIN"), summarize,
freq0 = sum(CLM_AMT5 == 0)/length(CLM_AMT5),
prob0 = mean(prob0),
prob1 = mean(prob1))
tmp$MVR_BIN <- as.numeric(as.character(tmp$MVR_BIN))
with(tmp, {plot(MVR_BIN, freq0, cex = 0.5, mgp = c(1.3, 0.2, 0),
ylab = "probability of zero")
lines(MVR_BIN, prob0, col = 2)
lines(MVR_BIN, prob1, col = 3)})
legend("topright", c("OBS", "CPGLM", "ZCPGLM"),
pch = c(1, NA, NA), lty = c(NA, 1, 1), col = 1:3)
@
\begin{figure}[t]
\centering
\includegraphics [angle = 0, scale = 0.8]{zcpglm.pdf}
\caption{Trace plot of the two chains in the claim triangle example. }
\label{fig:zcpglm}
\end{figure}
%pdf("zcpglm.pdf", 2.5, 2.5)
%parold <- mypar()
%with(tmp, {plot(MVR_BIN, freq0, cex = 0.5, mgp = c(1.3, 0.2, 0),
% ylab = "probability of zero")
% lines(MVR_BIN, prob0, col = 2)
% lines(MVR_BIN, prob1, col = 3)})
%legend("topright", c("OBS", "CPGLM", "ZCPGLM"),
% pch = c(1, NA, NA), lty = c(NA, 1, 1), col = 1:3, cex = 0.4)
%par(parold)
%dev.off()
In the above, the \texttt{predict} method computes the fitted/predicted values depending on the \texttt{type} argument. The returned value is $(1 - q_i)\cdot \mu_i$, $q_i$ or $\mu_i$ for \texttt{type = "response"} (the default), \texttt{type = "zero"} or \texttt{type = "tweedie"}, respectively.
\subsection{The Gini index}
For model comparison involving the compound Poisson distribution, the usual mean squared loss function is not quite informative for capturing the differences between predictions and observations, due to the high proportions of zeros and the skewed heavy-tailed distribution of the positive outcomes. For this reason, \cite{Frees;Meyers;Cummings:2011} develop an ordered version of the Lorenz curve and the associated Gini index as a statistical measure of the association between distributions, through which different predictive models can be compared. The idea is that a score (model) with a greater Gini index produces a greater separation among the observations. In the insurance context, a higher Gini index indicates greater ability to distinguish good risks from bad risks. Therefore, the model with the highest Gini index is preferred.
Continuing using insurance as an example, we let $y_i$ be the observed loss payment, $P_i$ be the baseline premium, $S_i$ be the insurance score (predictions from the model), $R_i = S_i / P_i$ be the relativity, for the $i_{th}$ observation. We sort all the policies by the relativities in an increasing order, and compute the empirical cumulative premium and loss distributions as
\begin{align}
\hat{F}_P(s) = \frac{\sum_{i=1}^n P_i \cdot \ind{(R_i \le s)}}{\sum_{i=1}^n P_i}, \; \hat{F}_L(s) = \frac{\sum_{i=1}^n y_i \cdot \ind{(R_i \le s)}}{\sum_{i=1}^n y_i}.
\end{align}
The graph $\left(\hat{F}_P(s), \hat{F}_L(s)\right)$ is an ordered Lorenz curve. The area between the Lorenz curve and the line of equality (a 45 degree line) is measure of the discrepancy between the premium and loss distributions, and twice this area is defined as the Gini index. Intuitively, the Gini index can be regarded as the average expected profit that can be gained under the new pricing scheme defined by the score.
The function \texttt{gini} computes the Gini indices and their asymptotic standard errors based on the ordered Lorenz curve. These metrics are mainly used for model comparison. Depending on the problem, there are generally two ways to do this. Take insurance predictive modeling as an example. First, when there is a baseline premium, we can compute the Gini index for each score (predictions from the model), and select the model with the highest Gini index. Second, when there is no baseline premium (\texttt{base = NULL}), we successively specify the prediction from each model as the baseline premium and use the remaining models as the scores. This results in a matrix of Gini indices, and we select the model that is least vulnerable to alternative models using a ``mini-max'' argument - we select the score that provides the smallest of the maximal Gini indices, taken over competing scores.
For example, we use the Gini index to compare the three models \texttt{P1}, \texttt{P2} and \texttt{P3} in the auto insurance example. This can be done as follows:
<<eval = FALSE>>=
da <- transform(da, P1 = fitted(P1), P2 = fitted(P2), P3 = fitted(P3))
gg <- gini(loss = "CLM_AMT5", score = paste("P", 1:3, sep = ""),
data = da)
gg
@
\begin{Verbatim}
Call:
gini(loss = "CLM_AMT5", score = paste("P", 1:3, sep = ""), data = da)
Gini indices:
P1 P2 P3
P1 0.000 12.028 11.880
P2 -1.085 0.000 3.345
P3 3.875 5.977 0.000
Standard errors:
P1 P2 P3
P1 0.000 2.148 2.098
P2 2.191 0.000 2.295
P3 2.106 2.275 0.000
\end{Verbatim}
We see that the GLM-type model is the worst, while according to the ``mini-max'' argument, the selected best model is the additive model \texttt{P2}. The \texttt{"gini"} object has a slot named \texttt{lorenz} that stores the graph for the underlying ordered Lorenz curve. A \texttt{plot} method is defined to extract and plot the graphs of the Lorenz curve using \texttt{ggplot2}:
<<eval = FALSE>>=
theme_set(theme_bw())
plot(gg, overlay = FALSE)
@
\begin{figure}[t]
\centering
\includegraphics{gini}
\caption{Plot of the ordered Lorenz curves for the auto insurance example.}
\end{figure}
\section{Acknowledgement}
\begin{itemize}
\item The \texttt{cpglmm} function is based on the source code of the \texttt{lme4.0} package, with changes made on updating of the mean, the variance function and the marginal loglikelihood. Major credits go to the authors of the package.
\item The functionality on penalized splines is due to Fabian Scheipl \email{fabian.scheipl@googlemail.com}, who wrote the original \texttt{amer} package.
\end{itemize}
\newpage
\bibliography{literatur}
\end{document}