SNS.Rnw

```
\documentclass[nojss]{jss}
%\documentclass[codesnippet]{jss}
%\documentclass{jss}
\usepackage[latin1]{inputenc}
%\pdfoutput=1
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{color}
%\usepackage[toc]{appendix}
\usepackage{amsthm}
%\VignetteIndexEntry{Stochastic Newton Sampler: R Package sns}
%\VignetteKeyword{negative definiteness, regression, optimization, sampling}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Definitions
% Vectors, Tensors
\def\vect#1{{\vec{#1}}} % Fancy vector
\def\tensor#1{{\mathbf{#1}}} % 2nd rank tensor
\def\mat#1{{\mathbf{#1}}} % matrix
\def\dotP#1#2{\vect{#1}\cdot\vect{#2}} % Dot product
% Derivatives
\def\deriv#1#2{\frac{d{}#1}{d{}#2}} % derivtive
\def\partderiv#1#2{\frac{\partial{}#1}{\partial{}#2}} % partial derivtive
% Math functions
\def\log#1{\text{log}\left(#1\right)}
% Statistics
\def\prob#1{\Prob\left(#1\right)} % probability
\newcommand{\bbeta}{\boldsymbol\beta}
\newcommand{\ggamma}{\boldsymbol\gamma}
\newcommand{\xx}{\mathbf{x}}
\newcommand{\yy}{\mathbf{y}}
\newcommand{\XX}{\mathbf{X}}
\newcommand{\llog}{\mathrm{log}}
\newcommand{\sigmamax}{\sigma_{\mathrm{max}}}
\newcommand{\dd}{\mathrm{d}}
\newtheorem{lemma}{Lemma}
\newcommand{\g}{\mathbf{g}}
\newcommand{\G}{\mathbf{G}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\HH}{\mathbf{H}}
\newcommand{\hh}{\mathbf{h}}
\newcommand{\ttheta}{\boldsymbol\theta}
\newcommand{\eeta}{\boldsymbol\eta}
\newcommand{\TT}{\mathbf{T}}
\newcommand{\pp}{\mathbf{p}}
\newcommand{\qq}{\mathbf{q}}
\newcommand{\PP}{\mathrm{P}}
\newcommand{\zz}{\mathbf{z}}
\newcommand{\SSigma}{\boldsymbol\Sigma}
\newcommand{\mmu}{\boldsymbol\mu}
\newcommand{\N}{\mathcal{N}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% almost as usual
\author{Alireza S. Mahani\\ Scientific Computing \\ Sentrana Inc. \And
Asad Hasan\\ Scientific Computing \\ Sentrana Inc. \AND
Marshall Jiang\\ Department of Mathematics \\ Cornell University \And
Mansour T.A. Sharabiani\\ National Heart and Lung Institute \\ Imperial College London}
\title{Stochastic Newton Sampler: \proglang{R} Package \pkg{sns}}
%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Alireza S. Mahani, Mansour T.A. Sharabiani} %% comma-separated
\Plaintitle{Stochastic Newton Sampler: R Package sns} %% without formatting
\Shorttitle{Stochastic Newton Sampler: \proglang{R} Package \pkg{sns}} %% a short title (if necessary)
%% an abstract and keywords
\Abstract{
The \proglang{R} package \pkg{sns} implements Stochastic Newton Sampler (SNS), a Metropolis-Hastings Monte Carlo Markov Chain algorithm where the proposal density function is a multivariate Gaussian based on a local, second-order Taylor series expansion of log-density. The mean of the proposal function is the full Newton step in Newton-Raphson optimization algorithm. Taking advantage of the local, multivariate geometry captured in log-density Hessian allows SNS to be more efficient than univariate samplers, approaching independent sampling as the density function increasingly resembles a multivariate Gaussian. SNS requires the log-density Hessian to be negative-definite everywhere in order to construct a valid proposal function. This property holds, or can be easily checked, for many GLM-like models. When initial point is far from density peak, running SNS in non-stochastic mode, i.e., taking the Newton step with line search, allows the MCMC chain to converge to high-density areas faster. For high-dimensional problems, partitioning of state space into lower-dimensional subsets, and applying SNS to the subsets within a Gibbs sampling framework can significantly improve the mixing of SNS chains. In addition to the above strategies for improving convergence and mixing, \pkg{sns} offers a rich set of diagnostics and visualization capabilities, as well as a function for sample-based calculation of Bayesian predictive posterior distributions.
}
\Keywords{monte carlo markov chain, metropolis-hastings, newton-raphson optimization, negative-definite hessian, log-concavity}
\Plainkeywords{monte carlo markov chain, metropolis-hastings, newton-raphson optimization, negative-definite hessian, log-concavity} %% without formatting
%% at least one keyword must be supplied
%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}
%% The address of (at least) one author should be given
%% in the following format:
\Address{
Alireza S. Mahani\\
Scientific Computing Group\\
Sentrana Inc.\\
1725 I St NW\\
Washington, DC 20006\\
E-mail: \email{alireza.mahani@sentrana.com}\\
}
\begin{document}
\SweaveOpts{concordance=TRUE}
%%\SweaveOpts{concordance=TRUE}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<<echo=FALSE, results=hide>>=
options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{section-introduction}
In most real-world applications of Monte Carlo Markov Chain (MCMC) sampling, the probability density function (PDF) being sampled is multidimensional. Univariate samplers such as Slice Sampler~\citep{neal2003slice} and Adaptive Rejection Sampler~\citep{gilks1992adaptive} can be embedded in the Gibbs sampling framework~\citep{geman1984stochastic} to sample from multivariate PDFs~\citep{mahani2014mfusampler}. Univariate samplers generally have few tuning parameters, making them ideal candidates for black-box MCMC software such as JAGS~\citep{plummer-jags} and OpenBUGS~\citep{thomas2006making}. However, they become less effective as PDF dimensionality rises and dimensions become more correlated~\citep{girolami2011riemann}. Therefore, development - and software implementation - of efficient, black-box multivariate MCMC algorithms is of great importance to widespread application of probabilistic models in statistics and machine learning.
The \proglang{R} package \pkg{sns} implements Stochastic Newton Sampler (SNS), a Metropolis-Hastings MCMC algorithm~\citep{hastings1970monte}, where the proposal distribution is a locally-fitted multivariate Gaussian resulting from second-order Taylor series expansion of the log-density. In its current implementation, SNS requires the log-density to be twice-diffenrentiable and globally concave, or equivalently that its Hessian matrix be negative-definite everywhere. For many Generalized Linear Models (GLMs) these conditions are satisfied~\citep{gilks1992adaptive}, and the invariance theorem of \cite{mahani2015expander} allows Hessian negative-definiteness to be studied and proven in the much lower-dimensional space of base distributions, rather than the high-dimensional space of regression coefficients.
SNS has appeared in the literature under several variations and labels. \cite{gamerman1997sampling} extend Iterative Reweighted Least Squares (IRLS) - the primary estimation technique for GLM models - to MCMC sampling by adding a Metropolis-Hastings step to it. Given that IRLS is a close cousin of Newton-Raphson optimization, their method can be considered a specialization of SNS for GLM models. \cite{qi2002hessian} present what they call `Hessian-based Metropolis-Hastings' (HMH), which is nearly identical to SNS, but they do not address the high-dimensional mixing problem, nor do they provide an open-source software implementation. More recently, the simplified manifold Metropolis adjusted Langevin Algorithm (MMALA) of \cite{girolami2011riemann} is very similar to SNS with the addition a tunable step size, or learning rate. The software accompanying their paper is written in \proglang{MATLAB}~\citep{MATLAB:2014}. The \proglang{R} package \pkg{sns}, to our knowledge, is the first open-source implemnentation of the SNS algorithm, including extensions for improving convergence (\code{rnd} and\code{nnr} arguments) and mixing (\code{part} argument), diagnostic and visualization methods (\code{summary.sns} and \code{plot.sns}), and sample-based prediction (\code{predict.sns}).
The paper is organized as follows. In Section~\ref{section-theory}, we review the theoretical background for SNS, including an overview of Metropolis-Hastings algorithms, followed by the multivariate Gaussian proposal PDF used in SNS. In Section~\ref{section-implementation}, we discuss the implementation of SNS algorithm in the \pkg{sns} package. Section~\ref{section-using} offers several examples to illustrate the usage of \pkg{sns}. Finally, Section~\ref{section-summary} provides a summary and pointers for future extensions to the software as well as potential research directions.
\section{Theory}\label{section-theory}
We begin with a brief overview of the Metropolis-Hastings algorithm.
\subsection{Metropolis-Hastings algorithm}\label{subsection-mh}
In Metropolis-Hastings (MH) MCMC sampling of the PDF, $p(\zz)$, we generate a sample $\zz^*$ from the proposal density function $q(\zz | \zz^\tau)$, where $\zz^\tau$ is the current state. We then accept the proposed state $\zz^*$ with probability $A(\zz^*, \zz^\tau)$, where:
\begin{equation}
A(\zz^*, \zz^\tau) = \mathrm{min} \left( 1 \,,\, \frac{p(\zz^*) q(\zz^\tau | \zz^*)}{p(\zz^\tau) q(\zz^* | \zz^\tau)} \right)
\end{equation}
The MH transitions satisfy detailed balance:
\begin{equation}
\begin{array}{lcl}
p(\zz)q(\zz|\zz')A(\zz',\zz) &=& \mathrm{min}(p(\zz)q(\zz|\zz'),p(\zz')q(\zz'|\zz)) \\
&=& \mathrm{min}(p(\zz')q(\zz'|\zz),p(\zz)q(\zz|\zz')) \\
&=& p(\zz')q(\zz'|\zz)A(\zz,\zz')
\end{array}
\end{equation}
The detailed balance property ensures that $p(\zz)$ is invariant under MH transitions. For a discussion of ergodicity of MH algorithm, see~\cite{roberts1999convergence}.
\subsection{SNS proposal density}\label{subsection-proposal}
SNS proposal density is a multivariate Gaussian fitted locally to the density being sampled, using the second-order Taylor-series expansion of the log-density:
\begin{equation}\label{equation-taylor}
f(\xx) \approx f(\xx_0) + \g(\xx_0)^\top \, (\xx-\xx_0) + \frac{1}{2} (\xx-\xx_0)^\top \, \HH(\xx_0) \, (\xx-\xx_0)
\end{equation}
where $f:\R^K \rightarrow \R$, and $\g$ and $\HH$ are the gradient vector and Hessian matrix for $f$, respectively, of dimensions $K$ and $K \times K$. Assuming that $f$ is globally concave, the above approximation is equivalent to fitting the following multivariate Gaussian (which we refer to as $F(\xx)$) to the PDF:
\begin{equation}\label{equation-gauss}
F(\xx) = \frac{1}{(2\pi)^{K/2}|\SSigma|^{1/2}} \exp\left\{ -\frac{1}{2}(\xx-\mmu)^T \SSigma^{-1}(\xx-\mmu) \right\}
\end{equation}
By comparing Equations~\ref{equation-taylor} and \ref{equation-gauss}, we see that the precision matrix is the same as negative Hessian: $\SSigma^{-1}=-\HH(\xx_0)$. The mean of the fitted Gaussian maximizes its log, and therefore:
\begin{equation} \label{equation-newton-step}
\mmu = \xx_0 - \HH^{-1}(\xx_0) \, \g(\xx_0)
\end{equation}
We can now formally define the (multivariate Gaussian) proposal density $q(. \,|\, \xx)$ as:
\begin{equation} \label{equation-proposal}
q(. \,|\, \xx) = \N(\xx - \HH^{-1}(\xx) \, \g(\xx) \,,\, -\HH^{-1}(\xx))
\end{equation}
Note that Equation~\ref{equation-newton-step} is simply the full Newton step~\citep{nocedal2006book}. We can therefore think of SNS as the stochastic counterpart of Newton-Raphson (NR) optimization. In NR optimization, we select the mean of the fitted Gaussian as the next step, while in SNS we draw a sample from the fitted Gaussian and apply MH test to accept or reject it. Also, note that in the special case where the sampled PDF is Gaussian, $f(\xx)$ is quadratic and therefore the proposal function is identical to the sampled PDF. In this case $A(\zz',\zz)$ is always equal to $1$, implying an acceptance rate of $100\%$.
\section{Software implementation and features}\label{section-implementation}
\subsection{Overview}\label{subsection-sns-algorithm}
The workhorse of \pkg{sns} package is the \code{sns} function, responsible for implementation of MH algorithm using the multivariate Gaussian proposal density described in Section~\ref{subsection-proposal}. \code{sns} implements the following steps:
\begin{enumerate}
\item
Evaluate the log-density function and its gradient and Hessian at $\xx_\text{old}$: $f_\text{old},\g_\text{old},\HH_\text{old}$.
\item\label{step-fit-1}
Construct the multivariate Gaussian proposal function at $q(.|\xx_{old})$ using Equation~\ref{equation-proposal} and $\xx=\xx_{old}$.
\item
Draw a sample $\xx_{prop}$ from $q(.|\xx_{old})$, and evaluate $logq_{prop}=\llog(q(\xx_{prop}|\xx_{old}))$.
\item
Evaluate the log-density function and its gradient and Hessian at $\xx_{prop}$: $f_{prop},\g_{prop},\HH_{prop}$.
\item\label{step-fit-2}
Construct the multivariate Gaussian proposal function at $q(.|\xx_{prop})$ using Equation~\ref{equation-proposal} and $\xx=\xx_{prop}$, and evaluate $logq_{old}=\llog(q(\xx_{old}|\xx_{prop}))$.
\item
Calculate the ratio $r=\exp((f_{prop}-f_{old})+(logq_{old}-logq_{prop}))$.
\item
If $r \geq 1$ accept $\xx_{prop}$: $\xx_{new} \leftarrow \xx_{prop}$. Else, draw a random deviate $s$ from a uniform distribution over $[0,1)$. If $s<r$, then accept $\xx_{prop}$: $\xx_{new} \leftarrow \xx_{prop}$, else reject $\xx_{prop}$: $\xx_{new} \leftarrow \xx_{old}$.
\end{enumerate}
Fitting the multivariate Gaussian in steps~\ref{step-fit-1} and ~\ref{step-fit-2} is done via calls to the private function \code{fitGaussian}. We use the functions \code{dmvnorm} and \code{rmvnorm} from package \pkg{mvtnorm} to calculate the log-density of, and draw samples from, multivariate Gaussian proposal functions.
There are two important arguments in \code{sns}, namely \code{rnd} and \code{part}. The first argument, \code{rnd}, controls whether the algorithm should run in stochastic or MCMC mode (which is the default choice), or in non-stochastic or Newton-Raphson (NR) mode. The second argument, \code{part}, controls the state space partitioning strategy. These arguments and their roles are described in Section~\ref{subsection-convergence-mixing}.
\code{sns.run} is a wrapper around \code{sns}, and offers the following functionalities:
\begin{enumerate}
\item Convenience of generating multiple samples via repeated calls to \code{sns}. After the first call, the Gaussian fit object (attached as attribute \code{gfit} in the returned value from \code{sns}) is fed by \code{sns.run} back to \code{sns} via argument \code{gfit}, in order to avoid unnecessary fitting of the proposal function at current value.
\item Collecting diagnostic information such as log-probability (time series), acceptance rate, relative deviation from quadratic approximation (time series), and components of MH test. These diagnostic measures are discussed in Section~\ref{subsection-diagnostics}, and their use is illustrated via examples in Section~\ref{section-using}.
\end{enumerate}
The generic methods \code{summary.sns}, \code{plot.sns} and \code{predict.sns} provide diagnostic, visualization, and prediction capabilities, discussed in Sections~\ref{subsection-diagnostics} and \ref{subsection-prediction}, and illustrated via examples in Section~\ref{section-using}.
\subsection{Improving convergence and mixing}\label{subsection-convergence-mixing}
\textbf{NR mode:} Far from the distribution mode, the local multivariate Gaussian fit can be severely different from the PDF, leading to small overlap between the two, low acceptance rate and hence bad convergence. This can be overcome by spending the first few iterations in non-stochastic or NR mode, where instead of drawing from the proposal function we simply accept its mean as the next step. Rather than taking a full Newton step, we have implemented line search~\citep{nocedal2006optim} to ensure convergence to the PDF maximum. To use \code{sns} in NR mode, users can set the argument \code{rnd} to \code{FALSE}. In NR mode, each iteration is guaranteed to increase the log-density. Using the NR mode during the initial burn-in phase is illustrated in Section~\ref{section-using}. In \code{sns.run}, the argument \code{nnr} controls how many initial iterations to be performend in NR mode.
\textbf{State space partitioning:} Even when near the PDF maximum, the fitted Gaussian can be severely different from the PDF. This can happen if the PDF has a significant third derivative, a phenomenon that we have observed for high-dimensional problems, especially when the number of observations is small. To improve bad mixing in high dimensions, we use a strategy which we refer to as `state space partitioning', where state space is partitioned into disjoint subsets and SNS is applied within each subset, wrapped in Gibbs sampling. This functionality is available via the \code{part} argument, which is a list containing the set of state space dimensions belonging to each subset. Convenience functions \code{sns.make.part} and \code{sns.check.part} allow users to easily create partition lists and check their validity, respectively.
\subsection{Diagnostics}\label{subsection-diagnostics}
\pkg{sns} includes a rich set of diagnostics which can be accessed via functions \code{summary.sns} and \code{plot.sns}. Some of these are generic measures applicable to all MCMC chains, some are specific to MH-based MCMC algorithms, and some are even further specialized for SNS as a particular flavor of MH. Where possible, we have used the library \pkg{coda}, but opted to create and maintain an independent set of functions due to their specialized and extended nature.
\textbf{MCMC diagnostics:} In \code{summary.sns}, we calculate the usual MCMC chain summaries including mean, standard deviation, quantiles, and effective sample size. We also calculate a sample-based p-value for each coordinate. In \code{plot.sns} we have log-density trace plot, state vector trace plots, effective sample size by coordinate, state vector histograms, and state vector autocorrelation plots.
\textbf{MH diagnostics:} In \code{summary.sns}, we calculate the acceptance rate of MH transition proposals. If \code{mh.diag} flag is set to \code{TRUE}, all 4 components of the MH test (\code{log.p}, \code{log.p.prop}, \code{log.q} and \code{log.q.prop}) are returned as well.
\textbf{SNS diagnostics:} In \code{summary.sns}, we return \code{reldev.mean} (if \code{sns.run} was called with \code{mh.diag} set to \code{TRUE}), defined as the average relative deviation of log-density change (with respect to PDF maximum) from quadratic approximation (also constructed at PDF maximum). The location of PDF maximum is extracted from the Gaussian fit in the last iteration under NR mode. The higher this value, the more likely it is for the SNS to exhibit bad mixing. This is illustrated in Section~\ref{section-using}. For \code{reldev.mean} to be valid, the user must ensure that the value of the argument \code{nnr} supplied to \code{sns.run} is sufficiently high to ensure convergence at the end of NR phase.
\subsection{Full Bayesian Prediction}\label{subsection-prediction}
The function \code{predict.sns} allows for full Bayesian prediction, using a sample-based representation of predictive posterior distribution. It accepts an arbitrary function of state vector as argument \code{fpred}, and applies the function across all samples of state vector, supplied in the first argument, which must be an output of \code{sns.run}. The core philosophy in full Bayesian prediction is to postpone summarization of samples until the last step. For example, rather than supplying the expected values of coefficients into a function, we supply the samples and take the expected value after applying the function. Following this proper approach is important for several reasons:
\begin{enumerate}
\item Mathematically, an arbitrary function is not commutable with the expected value operator. Therefore, applying expected value early produces incorrect results.
\item For a similar reason, confidence intervals cannot be propagated through arbitrary functions. Therefore, correct uncertainty measurement also requires a full Bayesian approach.
\end{enumerate}
\section[]{Using \pkg{sns}}\label{section-using}
In this section, we illustrate how \pkg{sns} can be used via several examples. First, we launch an R session and load the \pkg{sns} package as well \pkg{mvtnorm} (used for evaluating the multivariate Gaussia log-density in example 1):
<<eval=TRUE>>=
library(sns)
library(mvtnorm)
@
\subsection{Example 1: Multivariate Gaussian}\label{subsection-example-gaussian}
Using \pkg{sns} to sample from a multivariate Gaussian is a contrived, but pedagogical, example. Since log-density for a multivariate Gaussian is quadratic, its second-order Taylor series expansion is not approximate but exact. In other words, the proposal function becomes location-independent, and equal to the sampled distribution. This means that 1) the MH test is always accepted, and 2) consecutive samples are completely independent, and hence the resulting chain is no longer Markovian. Of course, since we know how to sample from multivariate Gaussian proposal functions, we might as well directly sample from the multivariate Gaussian distribution. Hence, the pedagogical nature of this example. To utilize \pkg{sns}, we must first implement the log-density and its gradient and Hessian:
<<eval=TRUE>>=
logdensity.mvg <- function(x, mu, isigsq) {
f <- dmvnorm(x = as.numeric(x)
, mean = mu, sigma = solve(isigsq), log = TRUE)
g <- - isigsq %*% (x - mu)
h <- -isigsq
return (list(f = f, g = g, h = h))
}
@
We now draw 500 samples from this log-desity, using pre-specified values for \code{mu} (mean vector) and \code{isigsq} (inverse of the covariance matrix, or precision matrix) in a 3-dimensional state space:
<<>>=
K <- 3
mu <- runif(K, min = -0.5, max = +0.5)
isigsq <- matrix(runif(K*K, min = 0.1, max = 0.2), ncol = K)
isigsq <- 0.5*(isigsq + t(isigsq))
diag(isigsq) <- rep(0.5, K)
x.init <- rep(0.0, K)
x.smp <- sns.run(x.init, logdensity.mvg, niter = 500
, mh.diag = TRUE, mu = mu, isigsq = isigsq)
@
Next, we use the \code{summary.sns} function to view some of the diagnostics:
<<>>=
summary(x.smp)
@
As expected, the acceptance rate is 100\%, and there is no deviation from quadratic approximation, for SNS sampling of a multivariate Gaussian.
In real-world applications, the Gaussian proposal function is only an approximation for the sampled distribution (since log-density is not quadratic), creating the Markovian dependency and less-than-perfect acceptance rate. We study one such example next.
\subsection{Example 2: Bayesian Poisson regression}\label{subsection-example-poisson}
Generalized Linear Models (GLMs)~\citep{nelder1972generalized} are an important family of statistical models with applications such as risk analysis~\citep{sobehart2000moody}, public health~\citep{azar2011immunologic} and political science~\citep{gelman2007data}. GLMs can be extended to incorporate data sparseness and heterogeneity via the Hierarchical Bayesian framework~\citep{peter2005bayesian} or to account for repeated measurements and longitudinal data via Generalized Linear Mixed Model~\citep{mcculloch2006generalized}. With properly-chosen link functions, many GLMs are known - or can be easily proven - to have globally-concave log-densities with negative-definite Hessian matrices~\citep{gilks1992adaptive,mahani2015expander}. As such, GLMs are excellent candidates for SNS. Embedded in Bayesian frameworks, they continue to enjoy log-concavity assuming the same property holds for prior terms, according to the Bayes' rule and the invariance of concavity for adding functions.
In our second example, we illustrate how to apply \pkg{sns} to the log-likelihood of Poisson regression. As before, we start with constructing the log-density, using the expander framework of \pkg{RegressionFactory} package:
<<>>=
library(RegressionFactory)
loglike.poisson <- function(beta, X, y) {
regfac.expand.1par(beta, X = X, y = y
, fbase1 = fbase1.poisson.log)
}
@
Now we simulate data from the generative model:
<<>>=
K <- 5
N <- 1000
X <- matrix(runif(N * K, -0.5, +0.5), ncol = K)
beta <- runif(K, -0.5, +0.5)
y <- rpois(N, exp(X %*% beta))
@
For reference, we do a maximum-likelihood (ML) estimation of the coefficients using \code{glm} command:
<<>>=
beta.init <- rep(0.0, K)
beta.glm <- glm(y ~ X - 1, family = "poisson", start = beta.init)$coefficients
@
As mentioned before, \pkg{sns} can be run in non-stochastic mode, which is equivalent to Newton-Raphson optimization with line search. Results should be identical, or very close, to \code{glm} results:
<<>>=
beta.sns <- sns.run(beta.init, fghEval = loglike.poisson
, niter = 20, nnr = 20, X = X, y = y)
beta.nr <- beta.sns[20, ]
cbind(beta.glm, beta.nr)
@
The primary use-case for \pkg{sns} is not ML estimation, but rather MCMC sampling of the distribution. To do this, we perform the first few iterations in non-stochastic mode (20 iterations here), and then switch to stochastic mode for the remaining 180 iterations:
<<>>=
beta.smp <- sns.run(beta.init, loglike.poisson
, niter = 200, nnr = 20, mh.diag = TRUE, X = X, y = y)
@
Examining the log-probability trace plot (Figure~\ref{fig-poisson-lp}) shows an expected pattern: During non-stochastic phase (left of vertical line) log-probability rises steadily while approaching the peak. During MCMC sampling, on the other hand, PDF maximum forms an upper bound for the MCMC movements, and the chain occasionally visits low-probability areas. The plot is created using the following line:
<<label=lp_plot,include=FALSE,echo=TRUE>>=
plot(beta.smp, select = 1)
@
\begin{figure}
%\begin{center}
<<label=fig1,fig=TRUE,echo=FALSE>>=
<<lp_plot>>
@
%\end{center}
\caption{Log-probability trace plot for Poisson regression problem with $K=5$ and $N=1000$. Vertical line separates non-stochastic mode (left, 20 iterations) from stochastic mode (right, 180 iterations).}
\label{fig-poisson-lp}
\end{figure}
The \code{plot.sns} function offers 4 other types of plots, besides the log-probability trace plot. We refer the reader to the package documentation for details. Further diagnostic information can be accessed via the \code{summary.sns} function:
<<>>=
summary(beta.smp)
@
The \code{summary.sns} function discards the first half of the samples as burn-in by default, before calculating sample statistics and acceptance. This behavior can be controlled via the argument \code{nburnin}. Arguments \code{end} and \code{thin} have behavior similar to their counterparts in the \code{as.mcmc} function of \pkg{coda} package. We observe two numbers in the summary print-out: Firstly, acceptance rate is less than 100\%, contrary to the case with a multivariate Gaussian PDF (example 1). Secondly, the mean relative deviation from quadratic approximation (\code{reldev.mean}) is now non-zero, again reflecting non-Gaussianity of the poisson likelihood PDF. The number (<1\%) is still small, however, leading to high acceptance rate and good mixing of the chain.
Next, we want to predict the response variable in Poisson regression model, given new values of the explanatory variables. We distinguish between two types of prediction: 1) predicting mean response, 2) generating samples from posterior predictive distribution. We illustrate how to do both using the \code{predict.sns} function. We begin by implementing the mean prediction function, which simply applies the inverse link function (exponential here) to the linear predictor. (For better comparison between the two prediction modes, we increase number of samples to 1000)
<<eval=FALSE>>=
beta.smp <- sns.run(beta.init, loglike.poisson
, niter = 1000, nnr = 20, mh.diag = TRUE, X = X, y = y)
predmean.poisson <- function(beta, Xnew) exp(Xnew %*% beta)
@
The following single line performs sample-based prediction of mean response (using \code{X} in lieu of \code{Xnew} for code brevity):
<<eval=FALSE>>=
ymean.new <- predict(beta.smp, predmean.poisson
, nburnin = 100, Xnew = X)
@
\code{ynew} is a matrix of \code{N} (1000) rows and \code{niter - nburnin} (900) columns. Each row corresponds to an observation (one row of \code{Xnew}), and each column corresponds to a prediction sample (one row of \code{beta.smp} after burn-in).
We can also generate samples from posterior predictive distribution as follows:
<<eval=FALSE>>=
predsmp.poisson <- function(beta, Xnew)
rpois(nrow(Xnew), exp(Xnew %*% beta))
ysmp.new <- predict(beta.smp, predsmp.poisson
, nburnin = 100, Xnew = X)
@
Comparing prediction summaries is illuminating:
<<eval=FALSE>>=
summary(ymean.new)
summary(ysmp.new)
@
<<echo=FALSE>>=
load("summs.pred")
print(summ.ymean)
print(summ.ysmp)
@
In the limit of infinite samples, the mean predictions from the two methods will be equal, and they are quite close based on 900 samples above. However, standard deviation of predictions is much larger for \code{predsmp.poisson} compared to \code{predmean.poisson}, as the former combines the uncertainty of coefficient values (represented in the \code{sd} values for \code{beta}'s) with the uncertainty of samples from Poisson distribution around the mean, i.e. the \code{sd} of Poisson distribution. Also note that, as expected, quantiles for \code{predsmp.poisson} are discrete since the predictions are discrete, while the quantiles for \code{predmean.poisson} are continuous as the predictions are continuous in this case.
\subsection{Example 3: High-dimensional Bayesian Poisson regression}\label{subsection-example-highD}
Contrary to standard Metropolis variants with multivariate Gaussians centered on current point, SNS is an aggressive, non-local MCMC algorithm as it seeks to construct a global, Gaussian approximation of the PDF. Under favorable conditions, this can lead to uncorrelated chains and efficient sampling, with the extreme case of perfectly uncorrelated samples for a multivariate Gaussian distribution. An important pathological case arises when the state space dimensionality is high. Continuing with the Poisson regression example, we increase $K$ from 5 to 100, while holding $N=1000$. To illustrate that the problem is not covergence but mixing, we explicitly use the \code{glm} estimate (mode of PDF) as the initial value for the MCMC chain:
<<eval=FALSE>>=
K <- 100
X <- matrix(runif(N * K, -0.5, +0.5), ncol = K)
beta <- runif(K, -0.5, +0.5)
y <- rpois(N, exp(X %*% beta))
beta.init <- glm(y ~ X - 1, family = "poisson")$coefficients
beta.smp <- sns.run(beta.init, loglike.poisson
, niter = 100, nnr = 10, mh.diag = TRUE, X = X, y = y)
summary(beta.smp)
@
<<echo=FALSE>>=
load("summs")
print(summ.nopart)
@
We see a significant drop in acceptance rate as well as effective sample sizes for the coefficients. Also note that mean relative deviation from quadratic approximation is now nearly 10x larger than the value for $K=5$. To improve mixing, we use the `state space partitioning' strategy of \pkg{sns}, available via the \code{part} argument of \code{sns} and \code{sns.run}. This leads to SNS sampling of subsets of state space wrapped in Gibbs cycles, with each subset being potentially much lower-dimensional than the original, full space. This strategy can significantly improve mixing:
<<eval=FALSE>>=
beta.smp.part <- sns.run(beta.init, loglike.poisson
, niter = 100, nnr = 10, mh.diag = TRUE
, part = sns.make.part(K, 10), X = X, y = y)
summary(beta.smp.part)
@
<<echo=FALSE>>=
print(summ.part)
@
Notice the improved acceptance rate as well as effective sample sizes. A comparison of log-probability trace plots confirms better mixing after convergence to PDF mode (see Figure~\ref{fig-poisson-lp-sbs}).
<<eval=FALSE>>=
par(mfrow = c(1,2))
plot(beta.smp, select = 1)
plot(beta.smp.part, select = 1)
@
\begin{figure}[H]
\vspace{6pc}
\includegraphics[scale=0.75]{figs.pdf}
\caption[]{Comparison of log-probability trace plots for $N=1000$ and $K=100$, without (left) and with (right) state space partitioning using 10 subsets.}
\label{fig-poisson-lp-sbs}
\end{figure}
%\subsection{Example 4: Embedding SNS in Gibbs framework}\label{subsection-example-embed}
%For many posterior distributions, log-concavity does not hold, or cannot be proven, over the entire state space. Instead, this property may exist within a subset of the state space. In such cases, one can use SNS in conjunction with other, more generic samplers such as slice sampler, embedded in Gibbs cycles. We illustrate this use-case in our last example.
\section{Summary}\label{section-summary}
In this paper we presented \pkg{sns}, an \proglang{R} package for Stochastic Newton Sampling of twice-differentiable, log-concave PDFs, where a multivariate Gaussian resulting from second-order Taylor series expansion of log-density is used as proposal function in a Metropolis-Hastings framework. Using an initial non-stochastic mode, equivalent to Newton-Raphson optimization with line search, allows the chain to rapidly converge to high-density areas, while `state space partitioning', Gibbs sampling of full state space in lower-dimensional blocks, allows SNS to overcome mixing problems while sampling from high-dimensional PDFs. There are several opportunities for further research and development.
\textbf{Beyond twice-differentiability and log-concavity:} Current version of SNS requires the log-density to be twice-differentiable and concave. In many real-world application, the posterior PDF does not have a negative-definite Hessian, or it cannot be proven to have such a property. In such cases, SNS would not be applicable to the entire PDF. However, if one can identify blocks within the full Hessian that does enjoy such property, then SNS can be combined with other, more generic sampling algorithms such as slice sampler or HMC, all embedded in a Gibbs cycle. The implementation would be similar to that of state space partitioning approach in \pkg{sns}, but replacing SNS with alternative samplers for some subsets. An alternative approach is discussed in~\cite{geweke2001bayesian} in the context of state-space models, where non-concave cases are handled by utilizing exponential and uniform distributions. Convergence and mixing properties of such extensions to general cases must be carefully studied. An important situation where twice-differentiability is violated is in the presence of boundary conditions. The current SNS algorithm assumes unconstrained state space. One way to deal with constrained subspaces is, again, to mix and match SNS and other samplers within Gibbs framework. For example, the slice sampler algorithm implemented in \pkg{MfUSampler} is capable of dealing with boxed constraints. It can therefore be assigned to subspaces with constraints, and the unconstrained subspaces can be handled by SNS. Further research is needed in order to relax twice-differentiability and log-concavity requirements for SNS.
\textbf{Optimized state space partitioning:}
\textbf{Performance benchmarking:}
There are several opportunities for further research and development:
\begin{enumerate}
\item Mixing properties of SNS must be better studied. Preliminary, one-dimensional analysis has provided sufficient conditions for the chain not to diverge too much from the distribution mode, but extension to the real-world, high-dimensional case remains to be done. This understanding can in turn be used to better select the size of state space blocks in the partitioning strategy.
\item Currently, SNS requires strict global log-concavity for the PDF. Extending the algorithm to more general cases can improve its usefulness.
\item Currently, SNS cannot handle boundary conditions. Instead, they must be implemented via variable transformations, such as link functions in GLM models.
\item Impact of data centering and scaling on convergence and mixing of chains, not just for SNS but for MCMC algorithms in general, including Gibbs sampling.
\end{enumerate}
\bibliography{SNS}
\end{document}
```