https://github.com/cran/flexmix
Raw File
Tip revision: 997d683c5da8d0abe4c2ad15eb77091ef7c66ebd authored by Bettina Gruen on 12 October 2020, 07:19:16 UTC
version 2.3-17
Tip revision: 997d683
mixture-regressions.Rnw
%
%  Copyright (C) 2008 Bettina Gruen and Friedrich Leisch
%  $Id: mixture-regressions.Rnw $
%
\documentclass[nojss]{jss} 
\usepackage{amsfonts}
\title{FlexMix Version 2: Finite Mixtures with Concomitant
  Variables and Varying and Constant Parameters}
\Plaintitle{FlexMix Version 2: Finite Mixtures with Concomitant
  Variables and Varying and Constant Parameters}
\Shorttitle{FlexMix Version 2}

\author{Bettina Gr{\"u}n\\
  Wirtschaftsuniversit{\"a}t Wien \And
  Friedrich Leisch\\
  Universit\"at f\"ur Bodenkultur Wien} 

\Plainauthor{Bettina Gr{\"u}n, Friedrich Leisch}

\Address{
  Bettina Gr\"un\\
  Institute for Statistics and Mathematics\\
  Wirtschaftsuniversit{\"a}t Wien\\
  Welthandelsplatz 1\\
  1020 Wien, Austria\\
  E-mail: \email{Bettina.Gruen@R-project.org}\\

  Friedrich Leisch\\
  Institut f\"ur Angewandte Statistik und EDV\\
  Universit\"at f\"ur Bodenkultur Wien\\
  Peter Jordan Stra\ss{}e 82\\
  1190 Wien, Austria\\
  E-mail: \email{Friedrich.Leisch@boku.ac.at}
}

\Abstract{ 

  This article is a (slightly) modified version of
  \cite{mixtures:Gruen+Leisch:2008a}, published in the \emph{Journal
    of Statistical Software}.
  
  \pkg{flexmix} provides infrastructure for flexible fitting of finite
  mixture models in \proglang{R} using the expectation-maximization
  (EM) algorithm or one of its variants. The functionality of the
  package was enhanced. Now concomitant variable models as well as
  varying and constant parameters for the component specific
  generalized linear regression models can be fitted.  The application
  of the package is demonstrated on several examples, the
  implementation described and examples given to illustrate how new
  drivers for the component specific models and the concomitant
  variable models can be defined.

}

\Keywords{\proglang{R}, finite mixture models, generalized linear models, concomitant variables}
\Plainkeywords{R, finite mixture models, generalized linear models, concomitant variables}

\usepackage{amsmath, listings}
\def\argmax{\mathop{\rm arg\,max}}
%% \usepackage{Sweave} prevent automatic inclusion
\SweaveOpts{width=9, height=4.5, eps=FALSE, keep.source=TRUE}
<<echo=false,results=hide>>=
options(width=60, prompt = "R> ", continue = "+  ", useFancyQuotes = FALSE)
library("graphics")
library("stats")
library("flexmix")
library("lattice")
ltheme <- canonical.theme("postscript", FALSE)
lattice.options(default.theme=ltheme)
data("NPreg", package = "flexmix")
data("dmft", package = "flexmix")
source("myConcomitant.R")
@ 

%%\VignetteIndexEntry{FlexMix Version 2: Finite Mixtures with Concomitant Variables and Varying and Constant Parameters}
%%\VignetteDepends{flexmix}
%%\VignetteKeywords{R, finite mixture models, model based clustering, latent class regression}
%%\VignettePackage{flexmix}

\begin{document}
%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section{Introduction}\label{sec:introduction}
Finite mixture models are a popular technique for modelling unobserved
heterogeneity or to approximate general distribution functions in a
semi-parametric way. They are used in a lot of different areas such as
astronomy, biology, economics, marketing or medicine. An overview on
mixture models is given in \cite{mixtures:Everitt+Hand:1981},
\cite{mixtures:Titterington+Smith+Makov:1985},
\cite{mixtures:McLachlan+Basford:1988}, \cite{mixtures:Boehning:1999},
\cite{mixtures:McLachlan+Peel:2000} and
\cite{mixtures:Fruehwirth-Schnatter:2006}.

Version 1 of \proglang{R} package \pkg{flexmix} was introduced in
\cite{mixtures:Leisch:2004}. The main design principles of the package
are extensibility and fast prototyping for new types of mixture
models. It uses \proglang{S}4 classes and methods
\citep{mixtures:Chambers:1998} as implemented in the \proglang{R}
package \pkg{methods} and exploits advanced features of \proglang{R} such as
lexical scoping \citep{mixtures:Gentleman+Ihaka:2000}. The package
implements a framework for maximum likelihood estimation with the
expectation-maximization (EM) algorithm
\citep{mixtures:Dempster+Laird+Rubin:1977}. The main focus is on
finite mixtures of regression models and it allows for multiple
independent responses and repeated measurements.  The EM algorithm can
be controlled through arguments such as the maximum number of
iterations or a minimum improvement in the likelihood to continue.

Newly introduced features in the current package version are
concomitant variable models \citep{mixtures:Dayton+Macready:1988} and
varying and constant parameters in the component specific regressions.
Varying parameters follow a finite mixture, i.e., several groups exist
in the population which have different parameters. Constant parameters
are fixed for the whole population. This model is similar to
mixed-effects models \citep{mixtures:Pinheiro+Bates:2000}.  The main
difference is that in this application the distribution of the varying
parameters is unknown and has to be estimated.  Thus the model is
actually closer to the varying-coefficients modelling framework
\citep{mixtures:Hastie+Tibshirani:1993}, using convex combinations of
discrete points as functional form for the varying coefficients.

The extension to constant and varying parameters allows for example to
fit varying intercept models as given in
\cite{mixtures:Follmann+Lambert:1989} and \cite{mixtures:Aitkin:1999}.
These models are frequently applied to account for overdispersion in
the data where the components follow either a binomial or Poisson
distribution. The model was also extended to include nested varying
parameters, i.e.~this allows to have groups of components with the
same parameters \citep{mixtures:Gruen+Leisch:2006,
  mixtures:Gruen:2006}.

In Section~\ref{sec:model-spec-estim} the extended model class is
presented together with the parameter estimation using the EM
algorithm. In Section~\ref{sec:using-new-funct} examples are given to
demonstrate how the new functionality can be used. An overview on the
implementational details is given in Section~\ref{sec:implementation}.
The new model drivers are presented and changes made to improve the
flexibility of the software and to enable the implementation of the
new features are discussed. Examples for writing new drivers for the
component specific models and the concomitant variable models are
given in Section~\ref{sec:writing-your-own}. This paper gives a short
overview on finite mixtures and the package in order to be
self-contained.  A more detailed introduction to finite mixtures and
the package \pkg{flexmix} can be found in \cite{mixtures:Leisch:2004}.

All computations and graphics in this paper have been done with
\pkg{flexmix} version
\Sexpr{packageDescription("flexmix",fields="Version")} and
\proglang{R} version \Sexpr{getRversion()} using Sweave
\citep{mixtures:Leisch:2002}. The newest release version of \pkg{flexmix} is
always available from the Comprehensive \proglang{R} Archive Network at
\url{http://CRAN.R-project.org/package=flexmix}. An up-to-date version of this paper
is contained in the package as a vignette, giving full access to the \proglang{R} code
behind all examples shown below. See \code{help("vignette")} or
  \cite{mixtures:Leisch:2003} for details on handling package vignettes.

%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section{Model specification and estimation}\label{sec:model-spec-estim}

A general model class of finite mixtures of regression models is
considered in the following. The mixture is assumed to consist of $K$
components where each component follows a parametric distribution.
Each component has a weight assigned which indicates the a-priori
probability for an observation to come from this component and the
mixture distribution is given by the weighted sum over the $K$
components. If the weights depend on further variables, these are
referred to as concomitant variables.

In marketing choice behaviour is often modelled in dependence of
marketing mix variables such as price, promotion and display. Under
the assumption that groups of respondents with different price,
promotion and display elasticities exist mixtures of regressions are
fitted to model consumer heterogeneity and segment the market.
Socio-demographic variables such as age and gender have often been
shown to be related to the different market segments even though they
generally do not perform well when used to a-priori segment the
market. The relationships between the behavioural and the
socio-demographic variables is then modelled through concomitant
variable models where the group sizes (i.e.~the weights of the
mixture) depend on the socio-demographic variables.

The model class is given by
\begin{align*}
  h(y|x, w, \psi) &= \sum_{k = 1}^K
  \pi_k(w, \alpha) f_k(y|x,\theta_{k})\\
  &= \sum_{k = 1}^K \pi_k(w, \alpha) \prod_{d=1}^D
  f_{kd}(y_d|x_d,\theta_{kd}),
\end{align*}
where $\psi$ denotes the vector of all parameters for the mixture
density $h()$ and is given by $(\alpha, (\theta_k)_{k=1,\ldots,K})$.
$y$ denotes the response, $x$ the predictor and $w$ the concomitant
variables. $f_k$ is the component specific density function.
Multivariate variables $y$ are assumed to be dividable into $D$
subsets where the component densities are independent between the
subsets, i.e.~the component density $f_k$ is given by a product over
$D$ densities which are defined for the subset variables $y_d$ and
$x_d$ for $d=1,\ldots,D$. The component specific parameters are given
by $\theta_k = (\theta_{kd})_{d=1,\ldots,D}$. Under the assumption
that $N$ observations are available the dimensions of the variables
are given by $y = (y_d)_{d=1,\ldots,D} \in \mathbb{R}^{N \times
  \sum_{d=1}^D k_{yd}}$, $x = (x_d)_{d=1,\ldots,D} \in \mathbb{R}^{N
  \times \sum_{d=1}^D k_{xd}}$ for all $d = 1,\ldots,D$ and $w \in
\mathbb{R}^{N \times k_w}$. In this notation $k_{yd}$ denotes the
dimension of the $d^{\textrm{th}}$ response, $k_{xd}$ the dimension of
the $d^{\textrm{th}}$ predictors and $k_w$ the dimension of the
concomitant variables. For mixtures of GLMs each of the $d$ responses
will in general be univariate, i.e.~multivariate responses will be
conditionally independent given the segment memberships. 

For the component weights $\pi_k$ it holds $\forall w$ that
\begin{equation}\label{eq:prior}
  \sum_{k=1}^K \pi_k(w,\alpha) = 1 \quad \textrm{and} \quad
  \pi_k(w, \alpha) > 0, \, \forall k,
\end{equation}
where $\alpha$ are the parameters of the concomitant variable model. 

For the moment focus is given to finite mixtures where the component
specific densities are from the same parametric family, i.e.~$f_{kd}
\equiv f_d$ for notational simplicity. If $f_d$ is from the
exponential family of distributions and for each component a
generalized linear model is fitted
\citep[GLMs;][]{mixtures:McCullagh+Nelder:1989} these models are also
called GLIMMIX models \citep{mixtures:Wedel+DeSarbo:1995}. In this
case the component specific parameters are given by $\theta_{kd} =
(\beta'_{kd}, \phi_{kd})$ where $\beta_{kd}$ are the regression
coefficients and $\phi_{kd}$ is the dispersion parameter.

The component specific parameters $\theta_{kd}$ are either restricted
to be equal over all components, to vary between groups of components
or to vary between all components. The varying between groups is
referred to as varying parameters with one level of nesting. A disjoint
partition $K_c$, $c = 1,\ldots,C$ of the set $\tilde{K} :=
\{1\ldots,K\}$ is defined for the regression coefficients. $C$ is the
number of groups of the regression coefficients at the nesting level.
The regression coefficients are accordingly split into three groups:
\begin{align*}
  \beta_{kd} &= (\beta'_{1d}, \beta'_{2,c(k)d},
  \beta'_{3,kd})',
\end{align*}
where $c(k) = \{c = 1,\ldots, C: k \in K_c\}$. 

Similar a disjoint partition $K_v$, $v = 1,\ldots,V$, of $\tilde{K}$
can be defined for the dispersion parameters if nested varying
parameters are present. $V$ denotes the number of groups of the
dispersion parameters at the nesting level.  This gives:
\begin{align*}
 \phi_{kd} &= \left\{\begin{array}{ll}
     \phi_{d} & \textrm{for constant parameters}\\
     \phi_{kd} & \textrm{for varying parameters}\\
     \phi_{v(k)d} & \textrm{for nested varying parameters}
     \end{array}\right.
\end{align*}
where $v(k) = \{v = 1,\ldots,V: k \in K_v\}$. The nesting structure of
the component specific parameters is also described in
\cite{mixtures:Gruen+Leisch:2006}.

Different concomitant variable models are possible to determine the
component weights \citep{mixtures:Dayton+Macready:1988}. The mapping
function only has to fulfill condition \eqref{eq:prior}. In the
following a multinomial logit model is assumed for the $\pi_k$ given
by
\begin{equation*}
  \pi_k(w,\alpha) =
  \frac{e^{w'\alpha_k}}{\sum_{u = 1}^K
    e^{w'\alpha_u}} \quad \forall k,
\end{equation*}
with $\alpha = (\alpha'_k)'_{k=1,\ldots,K}$ and $\alpha_1 \equiv 0$.
%%-------------------------------------------------------------------------
\subsection{Parameter estimation}\label{sec:estimation}
The EM algorithm \citep{mixtures:Dempster+Laird+Rubin:1977} is the
most common method for maximum likelihood estimation of finite mixture
models where the number of components $K$ is fixed. The EM algorithm
applies a missing data augmentation scheme. It is assumed that a
latent variable $z_n \in \{0,1\}^K$ exists for each observation $n$
which indicates the component membership, i.e.~$z_{nk}$ equals 1 if
observation $n$ comes from component $k$ and 0 otherwise. Furthermore
it holds that $\sum_{k=1}^K z_{nk}=1$ for all $n$. In the EM algorithm
these unobserved component memberships $z_{nk}$ of the observations
are treated as missing values and the data is augmented by estimates
of the component membership, i.e.~the estimated a-posteriori
probabilities $\hat{p}_{nk}$. For a sample of $N$ observations
$\{(y_1, x_1, w_1), \ldots, (y_N, x_N, w_N)\}$ the EM algorithm is
given by:
\begin{description}
\item[E-step:] Given the current parameter estimates $\psi^{(i)}$ in
  the $i$-th iteration, replace the missing data $z_{nk}$ by the
  estimated a-posteriori probabilities
\begin{align*}
  \hat{p}_{nk} & = \frac{\displaystyle
    \pi_k(w_n, \alpha^{(i)}) f(y_n| x_n,
    \theta_k^{(i)}) 
  }{\displaystyle
    \sum_{u = 1}^K \pi_u(w_n, \alpha^{(i)}) f(y_n |x_n,
    \theta_u^{(i)}) 
  }.
\end{align*}
\item[M-step:] Given the estimates for the a-posteriori probabilities
  $\hat{p}_{nk}$ (which are functions of $\psi^{(i)}$), obtain new
  estimates $\psi^{(i+1)}$ of the parameters by maximizing
\begin{align*}
  Q(\psi^{(i+1)}|\psi^{(i)}) &= Q_1(\theta^{(i+1)} | \psi^{(i)}) +
  Q_2(\alpha^{(i+1)} | \psi^{(i)}),
\end{align*}
where
\begin{align*}
  Q_1(\theta^{(i+1)} | \psi^{(i)}) &= \sum_{n = 1}^N \sum_{k = 1}^K
  \hat{p}_{nk} \log(f(y_n | x_n, \theta_k^{(i+1)}))
\end{align*}
and
\begin{align*}
  Q_2(\alpha^{(i+1)}| \psi^{(i)}) &= \sum_{n = 1}^N
  \sum_{k =  1}^K \hat{p}_{nk}
  \log(\pi_k(w_n, \alpha^{(i+1)})). 
\end{align*}
$Q_1$ and $Q_2$ can be maximized separately. The maximization of $Q_1$
gives new estimates $\theta^{(i+1)}$ and the maximization of $Q_2$
gives $\alpha^{(i+1)}$. $Q_1$ is maximized separately for each
$d=1,\ldots,D$ using weighted ML estimation of GLMs and $Q_2$ using
weighted ML estimation of multinomial logit models.
\end{description}

Different variants of the EM algorithm exist such as the stochastic EM
\citep[SEM;][]{mixtures:Diebolt+Ip:1996} or the classification EM
\citep[CEM;][]{mixtures:Celeux+Govaert:1992}. These two variants are
also implemented in package \pkg{flexmix}. For both variants an
additional step is made between the expectation and maximization
steps. This step uses the estimated a-posteriori probabilities and
assigns each observation to only one component, i.e.~classifies it
into one component. For SEM this assignment is determined in a
stochastic way while it is a deterministic assignment for CEM. For the
SEM algorithm the additional step is given by:
\begin{description}
\item[S-step:] Given the a-posteriori probabilities draw 
\begin{align*}
  \hat{z}_n &\sim \textrm{Mult}((\hat{p}_{nk})_{k=1,\ldots,K}, 1)
\end{align*}
where $\textrm{Mult}(\theta, T)$ denotes the multinomial distribution
with success probabilities $\theta$ and number of trials $T$.
\end{description}
Afterwards, the $\hat{z}_{nk}$ are used instead of the
$\hat{p}_{nk}$ in the M-step. For the CEM the additional step is given
by:
\begin{description}
\item[C-step:] Given the a-posteriori probabilities define
\begin{align*}
  \hat{z}_{nk} &= \left\{\begin{array}{ll}
      1&\textrm{if } k = \min\{ l : \hat{p}_{nl} \geq \hat{p}_{nk}\, \forall k=1,\ldots,K\}\\
      0&\textrm{otherwise}.
      \end{array}\right.
\end{align*}
\end{description}
Please note that in this step the observation is assigned to the
component with the smallest index if the same maximum a-posteriori
probability is observed for several components.

Both of these variants have been proposed to improve the performance
of the EM algorithm, because the ordinary EM algorithm tends to
converge rather slowly and only to a local optimum. The convergence
behavior can be expected to be better for the CEM than ordinary EM
algorithm, while SEM can escape convergence to a local optimum.
However, the CEM algorithm does not give ML estimates because it
maximizes the complete likelihood.  For SEM good approximations of the
ML estimator are obtained if the parameters where the maximum
likelihood was encountered are used as estimates.  Another possibility
for determining parameter estimates from the SEM algorithm could be
the mean after discarding a suitable number of burn-ins. An
implementational advantage of both variants is that no weighted
maximization is necessary in the M-step.

It has been shown that the values of the likelihood are monotonically
increased during the EM algorithm. On the one hand this ensures the
convergence of the EM algorithm if the likelihood is bounded, but on
the other hand only the detection of a local maximum can be
guaranteed.  Therefore, it is recommended to repeat the EM algorithm
with different initializations and choose as final solution the one
with the maximum likelihood. Different initialization strategies for
the EM algorithm have been proposed, as its convergence to the optimal
solution depends on the initialization
\citep{mixtures:Biernacki+Celeux+Govaert:2003,mixtures:Karlis+Xekalaki:2003}.
Proposed strategies are for example to first make several runs of the
SEM or CEM algorithm with different random initializations and then
start the EM at the best solution encountered.

The component specific parameter estimates can be determined
separately for each $d=1,\ldots,D$. For simplicity of presentation the
following description assumes $D=1$. If all parameter estimates vary
between the component distributions they can be determined separately
for each component in the M-step.  However, if also constant or nested
varying parameters are specified, the component specific estimation
problems are not independent from each other any more. Parameters have
to be estimated which occur in several or all components and hence,
the parameters of the different components have to be determined
simultaneously for all components.  The estimation problem for all
component specific parameters is then obtained by replicating the
vector of observations $y = (y_n)_{n=1,\ldots,N}$ $K$ times and
defining the covariate matrix $X = (X_{\textrm{constant}},
X_{\textrm{nested}}, X_{\textrm{varying}})$ by
\begin{align*}
  &X_{\textrm{constant}} = \mathbf{1}_K \otimes (x'_{1,n})_{n=1,\ldots,N}\\
  &X_{\textrm{nested}} = \mathbf{J} \odot (x'_{2,n})_{n=1,\ldots,N}\\
  &X_{\textrm{varying}} = \mathbf{I}_K \otimes(x'_{3,n})_{n=1,\ldots,N},
\end{align*}
where $\mathbf{1}_K$ is a vector of 1s of length $K$, $\mathbf{J}$ is
the incidence matrix for each component $k=1,\ldots,K$ and each
nesting group $c \in C$ and hence is of dimension $K \times |C|$, and
$\mathbf{I}_K$ is the identity matrix of dimension $K \times K$.
$\otimes$ denotes the Kronecker product and $\odot$ the Khatri-Rao
product (i.e., the column-wise Kronecker product). $x_{m,n}$ are the
covariates of the corresponding coefficients $\beta_{m,.}$ for
$m=1,2,3$. Please note that the weights used for the estimation are
the a-posteriori probabilities which are stacked for all components,
i.e.~a vector of length $N K$ is obtained.

Due to the replication of data in the case of constant or nested
varying parameters the amount of memory needed for fitting the mixture
model to large datasets is substantially increased and it might be
easier to fit only varying coefficients to these datasets.  To
overcome this problem it could be considered to implement special data
structures in order to avoid storing the same data multiple times for
large datasets.

Before each M-step the average component sizes (over the given data
points) are checked and components which are smaller than a given
(relative) minimum size are omitted in order to avoid too small
components where fitting problems might arise. This strategy has
already been recommended for the SEM algorithm
\citep{mixtures:Celeux+Diebolt:1988} because it allows to determine
the suitable number of components in an automatic way given that the
a-priori specified number of components is large enough. This
recommendation is based on the assumption that the redundent
components will be omitted during the estimation process if the
algorithm is started with too many components. If omission of small
components is not desired the minimum size required can be set to
zero. All components will be then retained throughout the EM algorithm
and a mixture with the number of components specified in the
initialization will be returned. The algorithm is stopped if the
relative change in the log-likelihood is smaller than a pre-specified
$\epsilon$ or the maximum number of iterations is reached.

For model selection different information criteria are available: AIC,
BIC and ICL \citep[Integrated Complete
Likelihood;][]{mixtures:Biernacki+Celeux+Govaert:2000}. They are of
the form twice the negative loglikelihood plus number of parameters
times $k$ where $k=2$ for the AIC and $k$ equals the logarithm of the
number of observations for the BIC. The ICL is the same as the BIC
except that the complete likelihood (where the missing class
memberships are replaced by the assignments induced by the maximum
a-posteriori probabilities) instead of the likelihood is used.
%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section{Using the new functionality}
\label{sec:using-new-funct}
In the following model fitting and model selection in \proglang{R} is
illustrated on several examples including mixtures of Gaussian,
binomial and Poisson regression models, see also
\cite{mixtures:Gruen:2006} and \cite{mixtures:Gruen+Leisch:2007a}.

More examples for mixtures of GLMs are provided as part of the
software package through a collection of artificial and real world
datasets, most of which have been previously used in the literature
(see references in the online help pages). Each dataset can be loaded
to \proglang{R} with \code{data("}\textit{name}\code{")} and the fitting of the proposed
models can be replayed using \code{example("}\textit{name}\code{")}.  Further details on
these examples are given in a user guide which can be accessed using
\code{vignette("regression-examples", package="flexmix")} from within
\proglang{R}. 
%%-----------------------------------------------------------------------  
\subsection{Artificial example}\label{sec:artificial-example}
In the following the artificial dataset \code{NPreg} is used which
has already been used in \cite{mixtures:Leisch:2004} to illustrate the
application of package \pkg{flexmix}. The data comes from two latent
classes of size \Sexpr{nrow(NPreg)/2} each and for each of the classes
the data is drawn with respect to the following structure:
\begin{center}
  \begin{tabular}{ll}
    Class~1: & $ \mathit{yn} = 5x+\epsilon$\\
    Class~2: & $ \mathit{yn} = 15+10x-x^2+\epsilon$
  \end{tabular}
\end{center}
with $\epsilon\sim N(0,9)$, see the left panel of
Figure~\ref{fig:npreg}. The dataset \code{NPreg} also includes a
response $\mathit{yp}$ which is given by a generalized linear model following a
Poisson distribution and using the logarithm as link function. The
parameters of the mean are given for the two classes by:
\begin{center}
  \begin{tabular}{ll}
    Class~1: & $ \mu_1 = 2 - 0.2x$\\
    Class~2: & $ \mu_2 = 1 + 0.1x$.
  \end{tabular}
\end{center}
This signifies that given $x$ the response $\mathit{yp}$ in group $k$ follows a
Poisson distribution with mean $e^{\mu_k}$, see the right panel of
Figure~\ref{fig:npreg}.

\setkeys{Gin}{width=\textwidth}
\begin{figure}
  \centering
<<fig=true, echo=false, results=hide>>=
par(mfrow=c(1,2))
plot(yn~x, col=class, pch=class, data=NPreg)
plot(yp~x, col=class, pch=class, data=NPreg)
@ 
  \caption{Standard regression example (left) and Poisson regression (right).}
  \label{fig:npreg}
\end{figure}

This model can be fitted in \proglang{R} using the commands:
<<>>= 
suppressWarnings(RNGversion("3.5.0"))
set.seed(1802)
library("flexmix")
data("NPreg", package = "flexmix")
Model_n <- FLXMRglm(yn ~ . + I(x^2))
Model_p <- FLXMRglm(yp ~ ., family = "poisson")
m1 <- flexmix(. ~ x, data = NPreg, k = 2, model = list(Model_n, Model_p),
  control = list(verbose = 10))
@ 

If the dimensions are independent the component specific model for
multivariate observations can be specified as a list of models for
each dimension.

The estimation can be controlled with the \code{control} argument
which is specified with an object of class \code{"FLXcontrol"}. For
convenience also a named list can be provided which is used to
construct and set the respective slots of the \code{"FLXcontrol"}
object.  Elements of the control object are \code{classify} to select
ordinary EM, CEM or SEM, \code{minprior} for the minimum relative size
of components, \code{iter.max} for the maximum number of iterations
and \code{verbose} for monitoring. If \code{verbose} is a positive
integer the log-likelihood is reported every \code{verbose} iterations
and at convergence together with the number of iterations made. The
default is to not report any log-likelihood information during the
fitting process.

The estimated model \code{m1} is of class \code{"flexmix"} and the
result of the default plot method for this class is given in
Figure~\ref{fig:root1}.  This plot method uses package \pkg{lattice}
\citep{mixtures:Sarkar:2008} and the usual parameters can be specified
to alter the plot, e.g.~the argument \code{layout} determines the
arrangement of the panels. The returned object is of class
\code{"trellis"} and the plotting can also be influenced by the
arguments of its show method.

The default plot prints rootograms (i.e., a histogram of the square
root of counts) of the a-posteriori probabilities of each observation
separately for each component. For each component the observations
with a-posteriori probabilities less than a pre-specified $\epsilon$
(default is $10^{-4}$) for this component are omitted in order to
avoid that the bar at zero dominates the plot
\citep{mixtures:Leisch:2004a}. Please note that the labels of the
y-axis report the number of observations in each bar, i.e.~the squared
values used for the rootograms.

\begin{figure}
  \centering
<<fig=true, echo=false, results=hide>>=
print(plot(m1))
@ 
\caption{The plot method for \code{"flexmix"} objects, here obtained
  by \code{plot(m1)}, shows rootograms of the posterior class
  probabilities.}
  \label{fig:root1}
\end{figure}

More detailed information on the estimated parameters with respect to
standard deviations and significance tests can be obtained with
function \code{refit()}. This function determines the
variance-covariance matrix of the estimated parameters by using the
inverted negative Hesse matrix as computed by the general purpose
optimizer \code{optim()} on the full likelihood of the
model. \code{optim()} is initialized in the solution obtained with the
EM algorithm. For mixtures of GLMs we also implemented the gradient,
which speeds up convergence and gives more precise estimates of the
Hessian.


Naturally, function \code{refit()} will also work for models which
have been determined by applying some model selection strategy
depending on the data (AIC, BIC, \ldots). The same caution is
necessary as when using \code{summary()} on standard linear models
selected using \code{step()}: The p-values shown are not correct
because they have not been adjusted for the fact that the same data
are used to select the model and compute the p-values. So use them
only in an exploratory manner in this context, see also
\cite{mixtures:Harrell:2001} for more details on the general problem.

The returned object can be inspected using \code{summary()} with
arguments \code{which} to specify if information for the component
model or the concomitant variable model should be shown and
\code{model} to indicate for which dimension of the component models
this should be done. Selecting \code{model=1} gives the parameter
estimates for the dimension where the response variable follows a
Gaussian distribution.


<<>>=
m1.refit <- refit(m1)
summary(m1.refit, which = "model", model = 1)
@ 

\begin{figure}
  \centering
<<fig=true, echo=false, results=hide, height=5>>=
print(plot(m1.refit, layout = c(1,3), bycluster = FALSE,
      main = expression(paste(yn *tilde(" ")* x + x^2))),
      split= c(1,1,2,1), more = TRUE)
print(plot(m1.refit, model = 2, 
           main = expression(paste(yp *tilde(" ")* x)), 
           layout = c(1,2), bycluster = FALSE), 
      split = c(2,1,2,1))
@ 
\caption{The default plot for refitted \code{"flexmix"} objects, here
  obtained by \code{plot(refit(m1), model = 1)} and
  \code{plot(refit(m1), model = 2)}, shows the coefficient estimates
  and their confidence intervals.}
  \label{fig:refit}
\end{figure}

The default plot method for the refitted \code{"flexmix"} object
depicts the estimated coefficients with corresponding confidence
intervals and is given in Figure~\ref{fig:refit}. It can be seen that
for the first model the confidence intervals of the coefficients of
the intercept and the quadratic term of \code{x} overlap with zero.

A model where these coefficients are set to zero can be estimated with
the model driver function \code{FLXMRglmfix()} and the following
commands for specifying the nesting structure. The argument
\code{nested} needs input for the number of components in each group
(given by \code{k}) and the formula which determines the model matrix
for the nesting (given by \code{formula}).  This information can be
provided in a named list.

For the restricted model the element \code{k} is a vector with two 1s
because each of the components has different parameters. The formulas
specifying the model matrices of these coefficients are
\verb/~ 1 + I(x^2)/ for an intercept and a quadratic term of $x$
for component 1 and \code{~ 0} for no additional coefficients for
component 2. The EM algorithm is initialized in the previously fitted
model by passing the posterior probabilities in the argument
\code{cluster}.
<<>>= 
Model_n2 <- FLXMRglmfix(yn ~ . + 0, nested = list(k = c(1, 1), 
  formula = c(~ 1 + I(x^2), ~ 0)))
m2 <- flexmix(. ~ x, data = NPreg, cluster = posterior(m1), 
  model = list(Model_n2, Model_p))
m2
@ 

Model selection based on the BIC would suggest the smaller model which
also corresponds to the true underlying model.
<<>>=
c(BIC(m1), BIC(m2))
@ 
%%-----------------------------------------------------------------------  
\subsection{Beta-blockers dataset}
\label{sec:beta-blockers}
The dataset is analyzed in \cite{mixtures:Aitkin:1999,
  mixtures:Aitkin:1999a} using a finite mixture of binomial regression
models. Furthermore, it is described in
\citet[p.~165]{mixtures:McLachlan+Peel:2000}. The dataset is from a
22-center clinical trial of beta-blockers for reducing mortality after
myocardial infarction. A two-level model is assumed to represent the
data, where centers are at the upper level and patients at the lower
level. The data is illustrated in Figure~\ref{fig:beta}.

First, the center information is ignored and a binomial logit
regression model with treatment as covariate is fitted using
\code{glm}, i.e.~$K=1$ and it is assumed that the different centers
are comparable:
<<>>=
data("betablocker", package = "flexmix")
betaGlm <- glm(cbind(Deaths, Total - Deaths) ~ Treatment, 
  family = "binomial", data = betablocker)
betaGlm
@ 

The residual deviance suggests that overdispersion is present in the
data. In the next step the intercept is allowed to follow a mixture
distribution given the centers. This signifies that the component
membership is fixed for each center. This grouping is specified in
\proglang{R} by adding \code{| Center} to the formula similar to the
notation used in \pkg{nlme} \citep{mixtures:Pinheiro+Bates:2000}.
Under the assumption of homogeneity within centers identifiability of
the model class can be ensured as induced by the sufficient conditions
for identifability given in \cite{mixtures:Follmann+Lambert:1991} for
binomial logit models with varying intercepts and
\cite{mixtures:Gruen+Leisch:2008} for multinomial logit models with
varying and constant parameters. In order to determine the suitable
number of components, the mixture is fitted with different numbers of
components.
<<>>=
betaMixFix <- stepFlexmix(cbind(Deaths, Total - Deaths) ~ 1 | Center,
  model = FLXMRglmfix(family = "binomial", fixed = ~ Treatment), 
  k = 2:4, nrep = 5, data = betablocker)
@ 

The returned object is of class \code{"stepFlexmix"} and printing the
object gives the information on the number of iterations until
termination of the EM algorithm, a logical indicating if the EM
algorithm has converged, the log-likelihood and some model information
criteria. The plot method compares the fitted models using the
different model information criteria.
<<>>=
betaMixFix
@ 

A specific \code{"flexmix"} model contained in the
\code{"stepFlexmix"} object can be selected using \code{getModel()}
with argument \code{which} to specify the selection criterion. The
best model with respect to the BIC is selected with:
<<>>=
betaMixFix_3 <- getModel(betaMixFix, which = "BIC")
betaMixFix_3 <- relabel(betaMixFix_3, "model", "Intercept")
@ 

The components of the selected model are ordered with respect to the
estimated intercept values.

In this case a model with three components is selected with respect to
the BIC. The fitted values for the model with three components are
given in Figure~\ref{fig:beta} separately for each component and the
treatment and control groups.

The fitted parameters of the component specific models can be accessed
with:
<<>>=
parameters(betaMixFix_3)
@ 

Please note that the coefficients of variable \code{Treatment} are the
same for all three components.

\begin{figure}
  \centering
<<fig=true, echo=false, results=hide>>=
library("grid")
betablocker$Center <- with(betablocker, factor(Center, levels = Center[order((Deaths/Total)[1:22])]))
clusters <- factor(clusters(betaMixFix_3), labels = paste("Cluster", 1:3))
print(dotplot(Deaths/Total ~ Center | clusters, groups  = Treatment, as.table = TRUE,
              data  = betablocker, xlab = "Center", layout = c(3, 1), 
              scales  = list(x = list(cex = 0.7, tck = c(1, 0))),
              key = simpleKey(levels(betablocker$Treatment), lines = TRUE, corner = c(1,0))))
betaMixFix.fitted <- fitted(betaMixFix_3)
for (i in 1:3) {
  seekViewport(trellis.vpname("panel", i, 1))
  grid.lines(unit(1:22, "native"), unit(betaMixFix.fitted[1:22, i], "native"), gp = gpar(lty = 1))
  grid.lines(unit(1:22, "native"), unit(betaMixFix.fitted[23:44, i], "native"), gp = gpar(lty = 2))
}
@
\setkeys{Gin}{width=0.8\textwidth}
\caption{Relative number of deaths for the treatment and the control
  group for each center in the beta-blocker dataset. The centers are
  sorted by the relative number of deaths in the control group. The
  lines indicate the fitted values for each component of the
  3-component mixture model with varying intercept and
  constant parameters for treatment.}
  \label{fig:beta}
\end{figure}

The variable \code{Treatment} can also be included in the varying part
of the model. This signifies that a mixture distribution is assumed
where for each component different values are allowed for the
intercept and the treatment coefficient. This mixture distribution can
be specified using function \code{FLXMRglm()}. Again it is assumed
that the heterogeneity is only between centers and therefore the
aggregated data for each center can be used.
<<>>=
betaMix <- stepFlexmix(cbind(Deaths, Total - Deaths) ~ Treatment | Center,
  model = FLXMRglm(family = "binomial"), k = 3, nrep = 5, 
  data = betablocker)
betaMix <- relabel(betaMix, "model", "Treatment")
parameters(betaMix)
c(BIC(betaMixFix_3), BIC(betaMix))
@

The difference between model \code{betaMix} and \code{betaMixFix\_3} is
that the treatment coefficients are the same for all three components
for \code{betaMixFix\_3} while they have different values for
\code{betaMix} which can easily be seen when comparing the fitted
component specific parameters. The larger model \code{betaMix} which
also allows varying parameters for treatment has a higher BIC and
therefore the smaller model \code{betaMixFix\_3} would be preferred.

The default plot for \code{"flexmix"} objects gives a rootogram of the
posterior probabilities for each component. Argument \code{mark} can
be used to inspect with which components the specified component
overlaps as all observations are coloured in the different panels
which are assigned to this component based on the maximum a-posteriori
probabilities.

\begin{figure}
  \centering
<<fig=true, echo=false, results=hide>>=
print(plot(betaMixFix_3, nint = 10, mark = 1, col = "grey", layout = c(3, 1)))
@ 
\caption{Default plot of \code{"flexmix"} objects where the
  observations assigned to the first component are
  marked.}\label{fig:default}
\end{figure}

\begin{figure}
  \centering
<<fig=true, echo=false, results=hide>>=
print(plot(betaMixFix_3, nint = 10, mark = 2, col = "grey", layout = c(3, 1)))
@ 
\caption{Default plot of \code{"flexmix"} objects where the observations
  assigned to the third component are marked.}\label{fig:default-2}
\end{figure}

The rootogram indicates that the components are well separated. In
Figure~\ref{fig:default} it can be seen that component 1 is completely
separated from the other two components, while
Figure~\ref{fig:default-2} shows that component 2 has a slight overlap
with both other components.

The cluster assignments using the maximum a-posteriori probabilities
are obtained with:
<<>>=
table(clusters(betaMix))
@ 

The estimated probabilities of death for each component for the
treated patients and those in the control group can be obtained with:
<<>>=
predict(betaMix, 
  newdata = data.frame(Treatment = c("Control", "Treated")))
@ 

or by obtaining the fitted values for two observations (e.g.~rows 1
and 23) with the desired levels of the predictor \code{Treatment}

<<>>=
betablocker[c(1, 23), ]
fitted(betaMix)[c(1, 23), ]
@ 

A further analysis of the model is possible with function
\code{refit()} which returns the estimated coefficients together with
the standard deviations, z-values and corresponding p-values. Please
note that the p-values are only approximate in the sense that they
have not been corrected for the fact that the data has already been
used to determine the specific fitted model.

<<>>=
summary(refit(betaMix))
@ 

Given the estimated treatment coefficients we now also compare this
model to a model where the treatment coefficient is assumed to be the
same for components 1 and 2. Such a model is specified using the model
driver \code{FLXMRglmfix()}. As the first two components are assumed
to have the same coeffcients for treatment and for the third component
the coefficient for treatment shall be set to zero the argument
\code{nested} has \code{k = c(2,1)} and \code{formula =
  c(\~{}Treatment, \~{})}.
<<>>=
ModelNested <- FLXMRglmfix(family = "binomial", nested = list(k = c(2, 1),
  formula = c(~ Treatment, ~ 0)))
betaMixNested <- flexmix(cbind(Deaths, Total - Deaths) ~ 1 | Center,
  model = ModelNested, k = 3, data = betablocker, 
  cluster = posterior(betaMix))
parameters(betaMixNested)
c(BIC(betaMix), BIC(betaMixNested), BIC(betaMixFix_3))
@ 

The comparison of the BIC values suggests that the nested model with
the same treatment effect for two components and no treatment effect
for the third component is the best.

%%-----------------------------------------------------------------------  
\subsection[Productivity of Ph.D. students in biochemistry]{Productivity of Ph.D.~students in biochemistry}
\label{sec:bioChemists}

<<hide=true, echo=false>>=
data("bioChemists", package = "flexmix")
@ 

This dataset is taken from \cite{mixtures:Long:1990}. It contains
\Sexpr{nrow(bioChemists)} observations from academics who obtained
their Ph.D.~degree in biochemistry in the 1950s and 60s. It includes
\Sexpr{sum(bioChemists$fem=="Women")} women and
\Sexpr{sum(bioChemists$fem=="Men")} men. The productivity was measured
by counting the number of publications in scientific journals during
the three years period ending the year after the Ph.D.~was received.
In addition data on the productivity and the prestige of the mentor
and the Ph.D.~department was collected. Two measures of family
characteristics were recorded: marriage status and number of children
of age 5 and lower by the year of the Ph.D.

First, mixtures with one, two and three components and only varying
parameters are fitted, and the model minimizing the BIC is selected.
This is based on the assumption that unobserved heterogeneity is
present in the data due to latent differences between the students in
order to be productive and achieve publications. Starting with the
most general model to determine the number of components using
information criteria and checking for possible model restrictions
after having the number of components fixed is a common strategy in
finite mixture modelling
\citep[see][]{mixtures:Wang+Puterman+Cockburn:1996}. Function
\code{refit()} is used to determine confidence intervals for the
parameters in order to choose suitable alternative models. However, it
has to be noted that in the course of the procedure these confidence
intervals will not be correct any more because the specific fitted
models have already been determined using the same data.

<<>>=
data("bioChemists", package = "flexmix")
Model1 <- FLXMRglm(family = "poisson")
ff_1 <- stepFlexmix(art ~ ., data = bioChemists, k = 1:3, model = Model1)
ff_1 <- getModel(ff_1, "BIC")
@ 

The selected model has \Sexpr{ff_1@k} components. The estimated
coefficients of the components are given in
Figure~\ref{fig:coefficients-1} together with the corresponding 95\%
confidence intervals using the plot method for objects returned by
\code{refit()}. The plot shows that the confidence intervals of the
parameters for \code{kid5}, \code{mar}, \code{ment} and \code{phd}
overlap for the two components. In a next step a mixture with two
components is therefore fitted where only a varying intercept and a
varying coefficient for \code{fem} is specified and all other
coefficients are constant.  The EM algorithm is initialized with the
fitted mixture model using \code{posterior()}.

\begin{figure}
  \centering
<<fig=true, echo=false, results=hide, height=5.4, width=8.1>>=
print(plot(refit(ff_1), bycluster = FALSE, 
           scales = list(x = list(relation = "free"))))
@  
\caption{Coefficient estimates and confidence intervals for the model
  with only varying parameters.}\label{fig:coefficients-1}
\end{figure}


<<>>=
Model2 <- FLXMRglmfix(family = "poisson", fixed = ~ kid5 + mar + ment)
ff_2 <- flexmix(art ~ fem + phd, data = bioChemists, 
  cluster = posterior(ff_1), model = Model2)
c(BIC(ff_1), BIC(ff_2))
@ 

If the BIC is used for model comparison the smaller model including
only varying coefficients for the intercept and \code{fem} is
preferred. The coefficients of the fitted model can be obtained using
\code{refit()}:

<<>>=
summary(refit(ff_2))
@ 

It can be seen that the coefficient of \code{phd} does for both
components not differ significantly from zero and might be omitted.
This again improves the BIC. 

<<>>=
Model3 <- FLXMRglmfix(family = "poisson", fixed = ~ kid5 + mar + ment)
ff_3 <- flexmix(art ~ fem, data = bioChemists, cluster = posterior(ff_2),
  model = Model3)
c(BIC(ff_2), BIC(ff_3))
@ 

The coefficients of the restricted model without \code{phd} are given
in Figure~\ref{fig:coefficients-2}. 

\begin{figure}[t]
  \centering
<<fig=true, echo=false, results=hide, height=5.4, width=8.1>>=
print(plot(refit(ff_3), bycluster = FALSE, scales = list(x = list(relation = "free"))))
@ 
\caption{Coefficient estimates and confidence intervals for the model
  with varying and constant parameters where the variable \code{phd}
  is not used in the regression.}\label{fig:coefficients-2}
\end{figure}

An alternative model would be to assume that gender does not directly
influence the number of articles but has an impact on the segment
sizes.
<<>>=
Model4 <- FLXMRglmfix(family = "poisson", fixed = ~ kid5 + mar + ment)
ff_4 <- flexmix(art ~ 1, data = bioChemists, cluster = posterior(ff_2),
  concomitant = FLXPmultinom(~ fem), model = Model4)
parameters(ff_4)
summary(refit(ff_4), which = "concomitant")
BIC(ff_4)
@ 
This suggests that the proportion of women is lower in the second
component which is the more productive segment.  

The alternative modelling strategy where homogeneity is assumed at the
beginning and a varying interept is added if overdispersion is
observed leads to the following model which is the best with respect
to the BIC.
<<>>=
Model5 <- FLXMRglmfix(family = "poisson", fixed = ~ kid5 + ment + fem)
ff_5 <- flexmix(art ~ 1, data = bioChemists, cluster = posterior(ff_2),
  model = Model5)
BIC(ff_5)
@ 

\begin{figure}
  \centering
\setkeys{Gin}{width=0.8\textwidth}
<<fig=true, echo=false, results=hide, width=7>>=
pp <- predict(ff_5, newdata = data.frame(kid5 = 0, 
  mar = factor("Married", levels = c("Single", "Married")),
  fem = c("Men", "Women"),  ment = mean(bioChemists$ment)))
matplot(0:12, sapply(unlist(pp), function(x) dpois(0:12, x)), 
        type = "b", lty = 1, xlab = "Number of articles", ylab = "Probability")
legend("topright", paste("Comp.", rep(1:2, each = 2), ":",
  c("Men", "Women")), lty = 1, col = 1:4, pch = paste(1:4), bty = "n")
@ 
\caption{The estimated productivity for each compoment for men and women.}
\label{fig:estimated}
\end{figure}
\setkeys{Gin}{width=0.98\textwidth}

In Figure~\ref{fig:estimated} the estimated distribution of
productivity for model \code{ff\_5} are given separately for men and
women as well as for each component where for all other variables the
mean values are used for the numeric variables and the most frequent
category for the categorical variables. The two components differ in
that component 1 contains the students who publish no article or only
a single article, while the students in component 2 write on average
several articles. With a constant coefficient for gender women publish
less articles than men in both components.

This example shows that different optimal models are chosen for
different modelling procedures. However, the distributions induced by
the different variants of the model class may be similar and therefore
it is not suprising that they then will have similar BIC values.

%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section{Implementation}\label{sec:implementation}
The new features extend the available model class described in
\cite{mixtures:Leisch:2004} by providing infrastructure for
concomitant variable models and for fitting mixtures of GLMs with
varying and constant parameters for the component specific parameters.
The implementation of the extensions of the model class made it
necessary to define a better class structure for the component
specific models and to modify the fit functions \code{flexmix()} and
\code{FLXfit()}.

An overview on the \proglang{S}4 class structure of the package is
given in Figure~\ref{fig:class structure}. There is a class for
unfitted finite mixture distributions given by \code{"FLXdist"} which
contains a list of \code{"FLXM"} objects which determine the component
specific models, a list of \code{"FLXcomponent"} objects which specify
functions to determine the component specific log-likelihoods and
predictions and which contain the component specific parameters, and
an object of class \code{"FLXP"} which specifies the concomitant
variable model. Class \code{"flexmix"} extends \code{"FLXdist"}. It
represents a fitted finite mixture distribution and it contains the
information about the fitting with the EM algorithm in the object of
class \code{"FLXcontrol"}. Repeated fitting with the EM algorithm with
different number of components is provided by function
\code{stepFlexmix()} which returns an object of class
\code{"stepFlexmix"}. Objects of class \code{"stepFlexmix"} contain
the list of the fitted mixture models for each number of components in
the slot \code{"models"}.

\setkeys{Gin}{width=.9\textwidth}
\begin{figure}[t]
  \centering
\includegraphics{flexmix}
\caption{UML class diagram \citep[see][]{mixtures:Fowler:2004} of the
  \pkg{flexmix} package.}
  \label{fig:class structure}
\end{figure}
\setkeys{Gin}{width=\textwidth}

For the component specific model a virtual class \code{"FLXM"} is
introduced which (currently) has two subclasses: \code{"FLXMC"} for
model-based clustering and \code{"FLXMR"} for clusterwise regression,
where predictor variables are given. Additional slots have been
introduced to allow for data preprocessing and the construction of the
components was separated from the fit and is implemented using lexical
scoping \citep{mixtures:Gentleman+Ihaka:2000} in the slot
\code{defineComponent}. \code{"FLXMC"} has an additional slot
\code{dist} to specify the name of the distribution of the variable.
In the future functionality shall be provided for sampling from a
fitted or unfitted finite mixture. Using this slot observations can be
generated by using the function which results from adding an \code{r}
at the beginnning of the distribution name. This allows to only
implement the (missing) random number generator functions and
otherwise use the same method for sampling from mixtures with
component specific models of class \code{"FLXMC"}.

For \code{flexmix()} and \code{FLXfit()} code blocks which are model
dependent have been identified and different methods implemented.
Finite mixtures of regressions with varying, nested and constant
parameters were a suitable model class for this identification task as
they are different from models previously implemented. The main
differences are:
\begin{itemize}
\item The number of components is related to the component specific
  model and the omission of small components during the EM algorithm
  impacts on the model.
\item The parameters of the component specific models can not be
  determined separately in the M-step and a joint model matrix is
  needed.
\end{itemize}
This makes it also necessary to have different model dependent methods
for \code{fitted()} which extracts the fitted values from a
\code{"flexmix"} object, \code{predict()} which predicts new values
for a \code{"flexmix"} object and \code{refit()} which refits an
estimated model to obtain additional information for a
\code{"flexmix"} object.
%%-----------------------------------------------------------------------  
\subsection{Component specific models with varying and constant
  parameters}\label{sec:comp-models-with}
A new M-step driver is provided which fits finite mixtures of GLMs
with constant and nested varying parameters for the coefficients and
the dispersion parameters. The class \code{"FLXMRglmfix"} returned by
the driver \code{FLXMRglmfix()} has the following additional slots
with respect to \code{"FLXMRglm"}:
\begin{description}
\item[\code{design}:] An incidence matrix indicating which columns of
  the model matrix are used for which component,
  i.e.~$\mathbf{D}=(\mathbf{1}_K,\mathbf{J}, \mathbf{I}_K)$.
\item[\code{nestedformula}:] An object of class \code{"FLXnested"}
  containing the formula for the nested regression coefficients and
  the number of components in each $K_c$, $c \in C$.
\item[\code{fixed}:] The formula for the constant regression
  coefficients.
\item[\code{variance}:] A logical indicating if different variances
  shall be estimated for the components following a Gaussian
  distribution or a vector specifying the nested structure for
  estimating these variances.
\end{description}
The difference between estimating finite mixtures including only
varying parameters using models specified with \code{FLXMRglm()} and
those with varying and constant parameters using function
\code{FLXMRglmfix()} is hidden from the user, as only the specified
model is different. The fitted model is also of class \code{"flexmix"}
and can be analyzed using the same functions as for any model fitted
using package \pkg{flexmix}. The methods used are the same except if
the slot containing the model is accessed and method dispatching is
made via the model class. New methods are provided for models of class
\code{"FLXMRglmfix"} for functions \code{refit()}, \code{fitted()} and
\code{predict()} which can be used for analyzing the fitted model.

The implementation allows repeated measurements by specifying a
grouping variable in the formula argument of \code{flexmix()}.
Furthermore, it has to be noticed that the model matrix is determined
by updating the formula of the varying parameters successively with
the formula of the constant and then of the nested varying parameters.
This ensures that if a mixture distribution is fitted for the
intercept, the model matrix of a categorical variable includes only
the remaining columns for the constant parameters to have full column
rank.  However, this updating scheme makes it impossible to estimate a
constant intercept while allowing varying parameters for a categorical
variable.

For this model one big model matrix is constructed where the
observations are repeated $K$ times and suitable columns of zero
added.  The coefficients of all $K$ components are determined
simultaneously in the M-step, while if only varying parameters are
specified the maximization of the likelihood is made separately for
all components. For large datasets the estimation of a combination of
constant and varying parameters might therefore be more challenging
than only varying parameters.
%% -----------------------------------------------------------------------  
\subsection{Concomitant variable models}\label{sec:conc-vari-models}
For representing concomitant variable models the class \code{"FLXP"}
is defined. It specifies how the concomitant variable model is fitted
using the concomitant variable model matrix as predictor variables and
the current a-posteriori probability estimates as response variables.
The object has the following slots:
\begin{description}
\item[\code{fit}:] A \code{function (x, y, ...)} returning the fitted
  values for the component weights during the EM algorithm.
\item[\code{refit}:] A \code{function (x, y, ...)} used for refitting
  the model.
\item[\code{df}:] A \code{function (x, k, ...)} returning the degrees
  of freedom used for estimating the concomitant variable model given
  the model matrix \code{x} and the number of components \code{k}.
\item[\code{x}:] A matrix containing the model matrix of the
  concomitant variables.
\item[\code{formula}:] The formula for determining the model matrix
  \code{x}.
\item[\code{name}:] A character string describing the model, which is
  only used for print output.
\end{description}
Two constructor functions for concomitant variable models are provided
at the moment. \code{FLXPconstant()} is for constant component weights
without concomitant variables and for multinomial logit models
\code{FLXPmultinom()} can be used. \code{FLXPmultinom()} has its own
class \code{"FLXPmultinom"} which extends \code{"FLXP"} and has an
additional slot \code{coef} for the fitted coefficients. The
multinomial logit models are fitted using package \pkg{nnet}
\citep{mixtures:Venables+Ripley:2002}.
%%-----------------------------------------------------------------------
\subsection{Further changes}
The estimation of the model with the EM algorithm was improved by
adapting the variants to correspond to the CEM and SEM variants as
outlined in the literature. To make this more explicit it is now also
possible to use \code{"CEM"} or \code{"SEM"} to specify an EM variant
in the \code{classify} argument of the \code{"FLXcontrol"}
object. Even though the SEM algorithm can in general not be expected
to converge the fitting procedure is also terminated for the SEM
algorithm if the change in the relative log-likelhood is smaller than
the pre-specified threshold. This is motivated by the fact that for
well separated clusters the posteriors might converge to an indicator
function with all weight concentrated in one component.  The fitted
model with the maximum likelihood encountered during the SEM algorithm
is returned.

For discrete data in general multiple observations with the same
values are given in a dataset. A \code{weights} argument was added to
the fitting function \code{flexmix()} in order to avoid repeating
these observations in the provided dataset. The specification is
through a \code{formula} in order to allow selecting a column of the
data frame given in the \code{data} argument. The weights argument
allows to avoid replicating the same observations and hence enables
more efficient memory use in these applications. This possibitliy is
especially useful in the context of model-based clustering for
mixtures of Poisson distributions or latent class analysis with
multivariate binary observations.

In order to be able to apply different initialization strategies such
as for example first running several different random initializations
with CEM and then switching to ordinary EM using the best solution
found by CEM for initialization a \code{posterior()} function was
implemented. \code{posterior()} also takes a \code{newdata} argument
and hence, it is possible to apply subset strategies for large
datasets as suggested in
\cite{mixtures:Wehrens+Buydens+Fraley:2004}. The returned matrix of
the posterior probabilities can be used to specify the \code{cluster}
argument for \code{flexmix()} and the posteriors are then used as
weights in the first M-step.

The default plot methods now use trellis graphics as implemented in
package \pkg{lattice} \citep{mixtures:Sarkar:2008}. Users familiar
with the syntax of these graphics and with the plotting and printing
arguments will find the application intuitive as a lot of plotting
arguments are passed to functions from \pkg{lattice} as for example
\code{xyplot()} and \code{histogram()}. In fact only new panel,
pre-panel and group-panel functions were implemented.  The returned
object is of class \code{"trellis"} and the show method for this class
is used to create the plot.

Function \code{refit()} was modified and has now two different
estimation methods: \code{"optim"} and \code{"mstep"}. The default
method \code{"optim"} determines the variance-covariance matrix of the
parameters from the inverse Hessian of the full log-likelihood.  The
general purpose optimizer \code{optim()} is used to maximize the
log-likelihood and initialized in the solution obtained with the EM
algorithm. For mixtures of GLMs there are also functions implemented
to determine the gradient which can be used to speed up
convergence. 

The second method \code{"mstep"} is only a raw approximation. It
performs an M-step where the a-posteriori probabilities are treated as
given instead of estimated and returns for the component specific
models nearly complete \code{"glm"} objects which can be further
analyzed. The advantage of this method is that the return value is
basically a list of standard \code{"glm"} objects, such that the
regular methods for this class can be used.

%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section{Writing your own drivers}\label{sec:writing-your-own}
Two examples are given in the following to demonstrate how new drivers
can be provided for concomitant variable models and for component
specific models. Easy extensibility is one of the main implementation
aims of the package and it can be seen that writing new drivers
requires only a few lines of code for providing the constructor
functions which include the fit functions.
%%-----------------------------------------------------------------------  
\subsection{Component specific models: Zero-inflated
  models}\label{sec:component-models}
\lstset{frame=trbl,basicstyle=\small\tt,stepnumber=5,numbers=left}

In Poisson or binomial regression models it can be often encountered
that the observed number of zeros is higher than expected. A mixture
with two components where one has mean zero can be used to model such
data.  These models are also referred to as zero-inflated models
\citep[see for example][]{mixtures:Boehning+Dietz+Schlattmann:1999}.
A generalization of this model class would be to fit mixtures with
more than two components where one component has a mean fixed at
zero. So this model class is a special case of a mixture of
generalized linear models where (a) the family is restricted to
Poisson and binomial and (b) the parameters of one component are
fixed. For simplicity the implementation assumes that the component
with mean zero is the first component. In addition we assume that the
model matrix contains an intercept and to have the first component
absorbing the access zeros the coefficient of the intercept is set to
$-\infty$ and all other coefficients are set to zero.

Hence, to implement this model using package \pkg{flexmix} an
appropriate model class is needed with a corresponding convenience
function for construction. During the fitting of the EM algorithm
using \code{flexmix()} different methods for this model class are
needed when determining the model matrix (to check the presence of an
intercept), to check the model after a component is removed and for
the M-step to account for the fact that the coefficients of the first
component are fixed. For all other methods those available for
\code{"FLXMRglm"} can be re-used. The code is given in
Figure~\ref{fig:ziglm.R}.

\begin{figure}
  \centering
  \begin{minipage}{0.98\textwidth}
    \lstinputlisting{ziglm.R}
  \end{minipage}
  \caption{Driver for a zero-inflated component specific model.}
  \label{fig:ziglm.R}
\end{figure}

The model class \code{"FLXMRziglm"} is defined as extending
\code{"FLXMRglm"} in order to be able to inherit methods from this
model class. For construction of a \code{"FLXMRziglm"} class the
convenicence function \code{FLXMRziglm()} is used which calls
\code{FLXMRglm()}. The only differences are that the family is
restricted to binomial or Poisson, that a different name is assigned
and that an object of the correct class is returned.

The presence of the intercept in the model matrix is checked in
\code{FLXgetModelmatrix()} after using the method available for
\code{"FLXMRglm"} models as indicated by the call to
\code{callNextMethod()}. During the EM algorithm
\code{FLXremoveComponent()} is called if one component is removed. For
this model class it checks if the first component has been removed and
if this is the case the model class is changed to \code{"FLXMRglm"}.

In the M-step the coefficients of the first component are fixed and
not estimated, while for the remaining components the M-step of
\code{"FLXMRglm"} objects can be used. During the EM algorithm
\code{FLXmstep()} is called to perform the M-step and returns a list
of \code{"FLXcomponent"} objects with the fitted parameters. A new
method for this function is needed for \code{"FLXMRziglm"} objects in
order to account for the fixed coefficients in the first component,
i.e.~for the first component the \code{"FLXcomponent"} object is
constructed and concatenated with the list of \code{"FLXcomponent"}
objects returned by using the \code{FLXmstep()} method for
\code{"FLXMRglm"} models for the remaining components.

Similar modifications are necessary in order to be able to use
\code{refit()} for this model class. The code for implementing the
\code{refit()} method using \code{optim()} for \code{"FLXMRziglm"} is
not shown, but can be inspected in the source code of the package.
                                                                           
\subsubsection{Example: Using the driver}

This new M-step driver can be used to estimate a zero-inflated Poisson
model to the data given in
\cite{mixtures:Boehning+Dietz+Schlattmann:1999}. The dataset
\code{dmft} consists of count data from a dental epidemiological study
for evaluation of various programs for reducing caries collected among
school children from an urban area of Belo Horizonte (Brazil). The
variables included are the number of decayed, missing or filled teeth
(DMFT index) at the beginning and at the end of the observation
period, the gender, the ethnic background and the specific treatment
for \Sexpr{nrow(dmft)} children.

The model can be fitted with the new driver function using the
following commands:
<<>>=
data("dmft", package = "flexmix")
Model <- FLXMRziglm(family = "poisson")
Fitted <- flexmix(End ~ log(Begin + 0.5) + Gender + Ethnic + Treatment, 
  model = Model, k = 2 , data = dmft, control = list(minprior = 0.01))
summary(refit(Fitted))
@

Please note that \cite{mixtures:Boehning+Dietz+Schlattmann:1999} added
the predictor \code{log(Begin + 0.5)} to serve as an offset in order
to be able to analyse the improvement in the DMFT index from the
beginning to the end of the study. The linear predictor with the
offset subtracted is intended to be an estimate for
$\log(\mathbb{E}(\textrm{End})) - \log(\mathbb{E}(\textrm{Begin}))$.
This is justified by the fact that for a Poisson distributed variable
$Y$ with mean between 1 and 10 it holds that $\mathbb{E}(\log(Y +
0.5))$ is approximately equal to $\log(\mathbb{E}(Y))$.
$\log(\textrm{Begin} + 0.5)$ can therefore be seen as an estimate for
$\log(\mathbb{E}(\textrm{Begin}))$.

The estimated coefficients with corresponding confidence intervals are
given in Figure~\ref{fig:dmft}. As the coefficients of the first
component are restricted a-priori to minus infinity for the intercept
and to zero for the other variables, they are of no interest and only
the second component is plotted.  The box ratio can be modified as for
\code{barchart()} in package \pkg{lattice}. The code to produce this
plot is given by:
<<refit, eval=false>>=
print(plot(refit(Fitted), components = 2, box.ratio = 3))
@ 

\begin{figure}
  \centering
\setkeys{Gin}{width=0.9\textwidth}
<<fig=true, echo=false, results=hide, width=7>>=
<<refit>>
@ 
\caption{The estimated coefficients of the zero-inflated model for the
  \code{dmft} dataset. The first component is not plotted as this
  component captures the inflated zeros and its coefficients are fixed
  a-priori.}
  \label{fig:dmft}
\end{figure}
%%-----------------------------------------------------------------------  
\subsection{Concomitant variable models}\label{sec:concomitant-models}
If the concomitant variable is a categorical variable, the multinomial
logit model is equivalent to a model where the component weights for
each level of the concomitant variable are determined by the mean
values of the a-posteriori probabilities. The driver which implements
this \code{"FLXP"} model is given in Figure~\ref{fig:myConcomitant.R}.
A name for the driver has to be specified and a \code{fit()}
function. In the \code{fit()} function the mean posterior probability
for all observations with the same covariate points is determined,
assigned to the corresponding observations and the full new
a-posteriori probability matrix returned. By contrast \code{refit()}
only returns the new a-posteriori probability matrix for the number of
unique covariate points.

\lstset{frame=trbl,basicstyle=\small\tt,stepnumber=5,numbers=left}
                                                                            
\begin{figure}
  \centering
  \begin{minipage}{0.98\textwidth}
    \lstinputlisting{myConcomitant.R}
  \end{minipage}
  \caption{Driver for a concomitant variable model where the component
    weights are determined by averaging over the a-posteriori
    probabilities for each level of the concomitant variable.}
  \label{fig:myConcomitant.R}
\end{figure}

\subsubsection{Example: Using the driver}

If the concomitant variable model returned by \code{myConcomitant()}
is used for the artificial example in
Section~\ref{sec:using-new-funct} the same fitted model is returned as
if a multinomial logit model is specified. An advantage is that in
this case no problems occur if the fitted probabilities are close to
zero or one.

<<results=hide>>=
Concomitant <- FLXPmultinom(~ yb)
MyConcomitant <- myConcomitant(~ yb)
set.seed(1234)
m2 <- flexmix(. ~ x, data = NPreg, k = 2, model = list(Model_n, Model_p), 
  concomitant = Concomitant)
m3 <- flexmix(. ~ x, data = NPreg, k = 2, model = list(Model_n, Model_p), 
  cluster = posterior(m2), concomitant = MyConcomitant)
@ 
<<>>=
summary(m2)
summary(m3)
@

For comparing the estimated component weights for each value of $\mathit{yb}$
the following function can be used:
<<>>=
determinePrior <- function(object) {
  object@concomitant@fit(object@concomitant@x, 
    posterior(object))[!duplicated(object@concomitant@x), ]
}
@ 

<<>>=
determinePrior(m2)
determinePrior(m3)
@
Obviously the fitted values of the two models correspond to each
other. 
%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section{Summary and outlook}\label{sec:summary-outlook}
Package \pkg{flexmix} was extended to cover finite mixtures of GLMs
with (nested) varying and constant parameters. This allows for example the
estimation of varying intercept models. In order to be
able to characterize the components given some variables concomitant
variable models can be estimated for the component weights.

The implementation of these extensions have triggered some
modifications in the class structure and in the fit functions
\code{flexmix()} and \code{FLXfit()}. For certain steps, as e.g.~the
M-step, methods which depend on the component specific models are
defined in order to enable the estimation of finite mixtures of GLMs
with only varying parameters and those with (nested) varying and
constant parameters with the same fit function. The flexibility of
this modified implementation is demonstrated by illustrating how a
driver for zero-inflated models can be defined.

In the future diagnostic tools based on resampling methods shall be
implemented as bootstrap results can give valuable insights into the
model fit \citep{mixtures:Gruen+Leisch:2004}. A function which
conveniently allows to test linear hypotheses about the parameters
using the variance-covariance matrix returned by \code{refit()} would
be a further valuable diagnostic tool.

The implementation of zero-inflated Poisson and binomial regression
models are a first step towards relaxing the assumption that all
component specific distributions are from the same parametric family.
As mixtures with components which follow distributions from different
parametric families can be useful for example to model outliers
\citep{mixtures:Dasgupta+Raftery:1998,mixtures:Leisch:2008}, it is intended to also make
this functionality available in \pkg{flexmix} in the future.


%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section*{Computational details}
<<echo=false,results=hide>>=
SI <- sessionInfo()
pkgs <- paste(sapply(c(SI$otherPkgs, SI$loadedOnly), function(x) 
                     paste("\\\\pkg{", x$Package, "} ", 
                           x$Version, sep = "")), collapse = ", ")
@ 

All computations and graphics in this paper have been done using
\proglang{R} version \Sexpr{getRversion()} with the packages
\Sexpr{pkgs}. 
%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\section*{Acknowledgments}

This research was supported by the the Austrian Science Foundation
(FWF) under grants P17382 and T351. Thanks also to Achim Zeileis for
helpful discussions on implementation details and an anonymous
referee for asking a good question about parameter significance which
initiated the new version of function \code{refit()}.
%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\bibliography{mixture}
%%-----------------------------------------------------------------------
%%-----------------------------------------------------------------------  
\end{document}
back to top