##### https://github.com/cran/rstpm2
Tip revision: a7579fe
multistate.Rnw
% -*- mode: noweb; ess-noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{Predictions for Markov multi-state models}
%\VignetteDepends{rstpm2}
%\VignetteKeyword{survival, spline}
%\VignettePackage{rstpm2}
%!\SweaveUTF8

\documentclass[nojss]{jss}

\usepackage{amsmath,amsfonts,enumitem,fancyvrb,hyperref}
\usepackage[utf8]{inputenc}
\VerbatimFootnotes
\usepackage[margin=2.6cm]{geometry} % wide margins
\usepackage{wasysym,bm}
\usepackage{tablefootnote}
\usepackage{algorithm2e}
\usepackage{graphicx, tikz}
\usetikzlibrary{arrows,decorations.pathmorphing,backgrounds,positioning,fit,petri,matrix}
\usepackage{xcolor}

\title{Predictions for parametric and penalised multi-state Markov models}

% Transforming parametric and penalized hazard estimates

\author{Mark~Clements\\Karolinska Institutet}

\Plainauthor{Mark~Clements}

\Plaintitle{Markov multi-state models}

\Abstract{

We describe an efficient algorithm and implementation for
predictions for parametic and penalized Markov multi-state
models. Predictions include state occupancy probabilities,
transition probabilities, prevalence, length of stay, relative
survival, screening sensitivity, and costs. The algorithm uses a system of ordinary differential
equations to calculate the predictions and their gradients, with
standard errors calculated using the delta method. These methods
have applications to a range of disciplines, including descriptive
epidemiology, causal inference and economic evaluations.

}

\Keywords{survival, multi-state models, Markov models}

\Plainkeywords{multi-state models, Markov models}

Department of Medical Epidemiology and Biostatistics\\
Karolinska Institutet\\
Email: \email{mark.clements@ki.se}
}

\begin{document}

\section{Introduction}

% Model fitting is challenging for interval-censored, where the likelihood requires integration of the hazards.

\emph{Multi-state models} form a very broad class of models that includes standard survival models with an initial and final state, competing risks with multiple final states, and illness-death models, with an initial healthy state, an illness state and a death state. This model class is useful for representing movement through a discrete set of states. As a motivating example, we are interested in the clinical pathways for cancer screening, diagnosis and treatment, taking account of utilities and costs. Using observational data, screening or treatment assignment may be confounded with patient characteristics, so we are also interested in using standardisation or the parametric g-formula to adjust for differences between groups.
% Should we use the negative biopsies as an example?

Predictions from multi-state models could include: transition hazards between states; transition probabilities for being in a particular state conditional on an initial state; state occupation probabilities for being in a particular state given an observed distribution of initial states; length of stay in a given state; prevalence for the live health states; contrasts of these predictions, including standardisation; and ratios of these predictions \citep{Touraine_Helmer_Joly_2016,Crowther_Lambert_2017}.

Much of the literature on multi-state models has focused on counting processes  \citep[e.g.][]{Andersen_Borgan_Gill_Keiding_1993,Andersen_Keiding_2002,Putter_Fiocco_Geskus_2007}. For implementations, \cite{Allignol_Schumacher_Beyersmann_2011} use the Aalen-Johansen estimator to estimate transition probabilities, and \cite{deWreede_Fiocco_Putter_2011} predict transition probabilities and their standard errors for transitions modelled using Cox regression. These implementations can also estimate length of stay and contrasts, however variance estimates would then require the non-parametric bootstrap. \cite{Ryalen_Stensrud_Roysland_2018} use stochastic differential equations to transform non-parametric cumulative hazard estimates into a variety of predictions. This approach is shown to work for Aalen's additive hazard model, but it is unclear whether the approach extend to Cox regression models. Their methods are a non-parametric analogue to our development.

Parametric and penalised survival models have potential advantages for multi-state models, including the ready incorporation of time-varying effects and for predictions outside of observed data \citep{Touraine_Helmer_Joly_2016,Crowther_Lambert_2017}.
\cite{Iacobelli_Carstensen_2013} simulate from Poisson regression models on different time scales to predict for multi-state models. \cite{Blaser_Vizcaya_Estill_Zahnd_Kalesan_Egger_Gsponer_Keiser_2015} provide a simulation framework for multi-state models with random times based on piece-wise constant hazards. \cite{Crowther_Lambert_2017} combine parametric time-to-event models to predict transition probabilities and length of stay. The authors use simulations, with variance estimation using the parametric bootstrap. None of these approaches scales well to allow for standardisation across moderate sized datasets.
% heemod - discrete time Markov models
% hesim

Predictions for Markov models can use ordinary differential equation solvers to solve Kolmogorov's forward differential equation. Recently, \cite{Jackson_2016} predicted length of stay using differential equations. \cite{Titman_2011} also demonstrated that the gradients for transition probabilities (or sensitivity equations) can be estimated by augmenting the system of differential equations.  Variance estimates for predictions can be calculated using the delta method and gradients for the predictors \citep{Gentleman_Lawless_Lindsey_Yan_1994}. This provides an opportunity to efficiently calculate a range of predictions from multi-state models.

We have two objectives: first, to develop efficient methods for predictions for \emph{smooth, non-homogeneous multi-state Markov models} with variance estimation using the \emph{multivariate delta method}; and, second, to demonstrate that these methods support the use of multi-state models across descriptive statistics, causal inference, and economics.
Importantly, our restriction to Markov models is not a heavy constraint: \cite{Datta_Satten_2001} have shown that the state occupation probabilities are consistently estimated under moderate conditions even when the time scale is mis-specified. Length of stay, prevalence, utilities and costs, when they are integrated functions of the state occupation probabilities, are also expected to be consistent under similar conditions.

In outline, we provide a theoretical development of a set of ordinary differential equations, simulate to assess the small sample properties, provide some examples and conclude with a brief discussion.

\section{Methods}

% Structure?

\subsection*{Assumptions}

For the predictions, the main inputs are (i) a multi-state model specification, (ii) the models for the transition intensities, and (iii) and the initial values for the states. We assume that the transition intensities are estimated using maximum (penalised) likelihood estimation, with stacked estimated parameters $\hat{\boldsymbol{\beta}}$ and stacked estimated variance-covariance matrix $\hat{\boldsymbol\Sigma}$. The estimated parameters could be from either model fit for all transitions, one model for each transition or a combination of models for different transitions.

For a formal development, the asymptotic properties for parametric survival models were developed by \cite{Borgan_1984} and \cite{Andersen_Borgan_Gill_Keiding_1993}. Under CramÃ©r-like conditions, the authors show that that the maximum likelihood parameters $\hat{\boldsymbol{\beta}}$ are asymptotically normal. Sufficient conditions include uniform convergence, that the hazard is thrice differentiable with respect to the parameters, boundedness of the hazard function, and that the hazard is bounded away from zero \citep{Borgan_1984}. We can then estimate the asymptotic variance for the predictions using the multivariate delta method, which will be asymptotically normal if the parameters are asymptotically normal and the gradients of the predictions exist (see Theorem 3.4.6 in \cite{Sen_Singer_1993}). \cite{Ryalen_Stensrud_Roysland_2018} developed asymptotic properties for non-parametric stochastic differential equations under Hadamard continuity. We will assess these asymptotic properties using simulations.

\subsection*{Notation}

For vectors $\boldsymbol{v}, \boldsymbol{v}_1$ and $\boldsymbol{v}_2$ and matrices $\boldsymbol{M}, \boldsymbol{M}_1$ and $\boldsymbol{M}_2$, let $\boldsymbol{v}_1\circ\boldsymbol{v}_2$ and $\boldsymbol{M}_1\circ\boldsymbol{M}_2$ be the Hadamard element-wise products, and let $\boldsymbol{v}_1\oslash\boldsymbol{v}_2$ and $\boldsymbol{M}_1\oslash\boldsymbol{M}_2$ be Hadamard element-wise division. We also define $\boldsymbol{M}\circ\boldsymbol{v}=\boldsymbol{M}\circ(\boldsymbol{v}\boldsymbol{1}^T)$ and $\boldsymbol{M}\oslash\boldsymbol{v}=\boldsymbol{M}\oslash(\boldsymbol{v}\boldsymbol{1}^T)$. We assume that $\circ$  and $\oslash$ have higher operator precedence than addition and subtraction.  Let $\text{diag}(\boldsymbol{v})$ represent a square matrix with zeros in the off-diagonal and $\boldsymbol{v}$ on the diagonal. We represent a vector of ones using $\boldsymbol{1}$ and an identity matrix by $\boldsymbol{I}$.

We define the gradient for a prediction $\boldsymbol\phi(t_0,t)$ with respect to an estimated model parameter $\hat\beta_m$  by $\boldsymbol\phi'_m(t_0,t)\equiv\left.\frac{\displaystyle\partial\boldsymbol\phi(t_0,t)}{\displaystyle\partial\beta_m}\right|_{\beta_m=\hat\beta_m}$.

\subsection*{General approach applied to transition and state occupancy probabilities}

Let the set of states be indexed by $i$ and $j$. We define the matrix of \emph{transition probabilities} $\boldsymbol{P}(t_0,t)=(P_{ij}(t_0,t))$ as the probabilities of being in state $j$ at time $t$ given being in an initial state $i$ at entry time $t_0$. For smooth hazards, the Markov property is expressed through Kolmogorov's forward differential equation, such that
\begin{align}
\frac{d\boldsymbol{P}(t_0,t)}{dt} &= \boldsymbol{P}(t_0,t) \boldsymbol{Q}(t) \label{eq:kolmogorov} \\
\boldsymbol{P}(t_0,t_0) &= \boldsymbol{I} \label{eq:Pinit}
\end{align}
where $\boldsymbol{Q}(t)=(Q_{ij}(t))$ is a matrix of transition intensities at time $t$\footnote{Technically, this is $t-$.} from state $i$ to state $j$ when $i\neq j$ (conceptualised as the rates from state $i$ to state $j$), and where $Q_{ii}(t)=-\sum_{j\neq i}Q_{ij}(t)$ (conceptualised as the rates out of state $i$). % The Markov property is based on a single clock $t$ (e.g. age of a subject or calendar time), whereas semi-Markov models are based on a reset clock (e.g. time in state for each new state).

% (toggle-debug-on-quit)

Following \cite{Titman_2011}, Kolmogorov's forward differential equation can be augmented to calculate the gradient for $\boldsymbol{P}(t_0,t)$ with respect to the model coefficients. Titman showed that
\begin{align}
\frac{d\boldsymbol{P}'_m(t_0,t)}{dt} &=\boldsymbol{P}'_m(t_0,t)\boldsymbol{Q}(t) + \boldsymbol{P}(t_0,t)\boldsymbol{Q}'_m(t) \label{eq:dPudt} \\
\boldsymbol{P}'_m(t_0,t_0) &= \boldsymbol{0} \label{eq:Puinit}
\end{align}
Note that this requires the evaluation of the gradients for the transition intensities with respect to $\hat\beta_m$. \emph{Algorithm 1} is defined as:

\begin{algorithm}[H]
\SetKwInOut{Input}{input}\SetKwInOut{Output}{output}
\Input{$\boldsymbol{P}(t_0,t_0),\boldsymbol{Q}(t), \left\{\boldsymbol{Q}'_m(t)\right\}\forall m, t_0, t_\text{max}$} \Output{$\boldsymbol{P}(t_0,t),\left\{\boldsymbol{P}'_m(t_0,t)\right\}\forall m\ \text{for}\ t\in\left[t_0,t_\text{max}\right)$}
\Begin{
define the ODE based on Equations~(\ref{eq:kolmogorov})--(\ref{eq:Puinit})\;
run an ODE solver from time $t_0$ to time $t_\text{max}$\;
}
\end{algorithm}

This algorithm could be done separately for each covariate pattern or the ODEs can be extended to multiple covariates by stacking the equations.
Using the multivariate delta method, the estimated covariance matrix for $\boldsymbol{P}(t_0,t)$ is
\begin{align*}
\text{var}(\boldsymbol{P}(t_0,t)) &= \boldsymbol{P}'_m(t_0,t)^T\, \hat{\boldsymbol\Sigma}\, \boldsymbol{P}'_m(t_0,t)
\end{align*}

Let the vector $\boldsymbol{\pi}(t_0,t_0)=(\pi_j(t_0,t_0))$ represent the initial \emph{state occupation probabilities} of being in state $j$ at time $t_0$. The state occupation probabilities of being in state $j$ at time $t>t_0$ can be calculated by $\boldsymbol{\pi}(t_0,t)^T=(\pi_j(t_0,t))=\boldsymbol{\pi}(t_0,t_0)^T\boldsymbol{P}(t_0,t)$. Differential equations can be readily calculated for the state occupation probabilities.

This general approach can be applied to a variety of predictions. The other predictions may require that a set of differential equations be defined for several related predictions, including their gradients, in combination with the delta method for variance estimation.

\subsection*{Example 1: Length of stay and restricted mean survival time}

The ordinary differential equations in Equations~(\ref{eq:kolmogorov})--(\ref{eq:Puinit}) can be further augmented to calculate the integral $L_{ij}(t_0,t)=\int_{t_0}^t P_{ij}(t_0,v)dv$. The additional differential equations are then
\begin{align}
\frac{d\boldsymbol{L}(t_0,t)}{dt} &= \boldsymbol{P}(t_0,t) \label{eq:los} \\
\frac{d\boldsymbol{L}'_m(t_0,t)}{dt} &= \boldsymbol{P}'_m(t_0,t) \label{eq:dLudt} \\
\boldsymbol{L}(t_0,t_0) &= \boldsymbol{L}'_m(t_0,t_0) = \boldsymbol{0} \label{eq:Linit}
\end{align}
where the matrix $\boldsymbol{L}(t_0,t)$ is the the \emph{length of stay} or sojourn time for state $j$ given an initial state $i$ at time $t_0$. Algorithm 1 would be augmented to include Equations~(\ref{eq:los})--(\ref{eq:Linit}) to calculate transition and state occupation probabilities and the lengths of stay.

For restricted mean survival times and life expectancy, we can weight the length of stay by a vector $\boldsymbol{w}=(w_j)$, where $w_j=1$ for a live state $j$, and $w_j=0$ for a death state $j$. We then have that the restricted mean survival is $\bm{L}(t_0,t)\bm{w}$. Moreover, as $t\rightarrow\infty$, $\bm{L}(t_0,t)\bm{w}$ will measure life expectancy.

\subsection*{Example 2: Prevalence}

Let the prevalence for the live states be defined by
\begin{align*}
\tilde{\boldsymbol{P}}(t_0,t) &= \left(\boldsymbol{P}(t_0,t)\text{diag}(\boldsymbol{w})\right)\oslash\left(\boldsymbol{P}(t_0,t)\boldsymbol{w}\right)
\end{align*}
where the weight vector $\bm{w}$ again has elements that are 1 for a live state and 0 for a death state. Generalising a result for illness-death models by \cite{Brinks_Hoyer_2018} to multi-state models, we have that
\begin{align*}
\frac{d\tilde{\boldsymbol{P}}(t_0,t)}{dt}
&= \left(\left(\frac{d\boldsymbol{P}(t_0,t)}{dt}\text{diag}(\boldsymbol{w})\right)
\circ \left(\boldsymbol{P}(t_0,t)\boldsymbol{w}\right) - \left(\boldsymbol{P}(t_0,t)\text{diag}(\boldsymbol{w})\right) \circ \left(\frac{d\boldsymbol{P}(t_0,t)}{dt}\boldsymbol{w}\right)\right)\\
\end{align*}
with an initial value that $\tilde{\boldsymbol{P}}(t_0,t_0)=\text{diag}(\boldsymbol{w})$. Alternatively, we can calculate $\text{logit}(\tilde{\boldsymbol{P}}(t_0,t))=\log(\tilde{\boldsymbol{P}}(t_0,t)) - \log(\boldsymbol{1}\boldsymbol{1}^T-\tilde{\boldsymbol{P}}(t_0,t))$, which has a gradient of
\begin{align*}
\left.\frac{\partial}{\partial\beta_m} \text{logit}(\tilde{\boldsymbol{P}}(t_0,t))\right|_{\beta_m=\hat\beta_m}
&= \left(
\left(\boldsymbol{P}'_m(t_0,t)\text{diag}(\boldsymbol{w})\right)\circ \left(\boldsymbol{P}(t_0,t)\boldsymbol{w}\right) - \left(\boldsymbol{P}(t_0,t)\text{diag}(\boldsymbol{w})\right)\circ \left(\boldsymbol{P}'_m(t_0,t)\boldsymbol{w}\right)
\right) \\
\left(\boldsymbol{1}\boldsymbol{1}^T - \tilde{\boldsymbol{P}}(t_0,t)\right) \oslash
\left(\boldsymbol{P}(t_0,t)\boldsymbol{w}\right)  \oslash
\left(\boldsymbol{P}(t_0,t)\boldsymbol{w}\right)
\end{align*}

Similarly, we could calculate the proportion of person-time in the live health states, which is calculated by $\tilde{\boldsymbol{L}}(t_0,t)=\left(\boldsymbol{L}(t_0,t)\text{diag}(\boldsymbol{w})\right)\oslash\left(\boldsymbol{L}(t_0,t)\boldsymbol{w}\right)$. The development for these predictions follows closely that for prevalence.

\subsection*{Example 3: Linear combinations, differences and standardisation}

Linearity combinations of these estimators are straightforward to calculate, as the gradients are then also linear. Let $\boldsymbol\phi(t_0,t|\boldsymbol{x}_k)$ represent an estimator, such as transition probabilities, state occupation probabilities or length of stay, conditional on covariate vector $\boldsymbol{x}_k$. Given weights $w_k$ for $k=1,\ldots,K$, we can calculate the weighted sums
\begin{align*}
\bar{\boldsymbol{\phi}}(t_0,t) &= \sum_{k=1}^K w_k \boldsymbol{\phi}(t_0,t|\boldsymbol{x}_k)\\
\bar{\boldsymbol{\phi}}'_m(t_0,t) &= \sum_{k=1}^K w_k \boldsymbol{\phi}'_m(t_0,t|\boldsymbol{x}_k)
\end{align*}
For differences, we can define $K=2$, $w_1=1$ and $w_2=-1$.

\emph{Standardisation} can be defined in terms of the mean prediction under a counterfactual exposure. For example, consider a binary exposure $X$ with confounders $\boldsymbol{Z}_i$ across a sample or population indexed by $i$. The standardised estimator for everyone been unexposed or exposed to $X$ is then
\begin{align*}
\bar{\boldsymbol\phi}_0(t_0,t) &= E_{\boldsymbol{Z}}(\boldsymbol\phi(t_0,t|X=0,\boldsymbol{Z})) =
\frac{1}{n}\sum_{i=1}^n \boldsymbol\phi(t_0,t|X=0,\boldsymbol{Z}_i)\\
\bar{\boldsymbol\phi}_1(t_0,t) &= E_{\boldsymbol{Z}}(\boldsymbol\phi(t_0,t|X=1,\boldsymbol{Z}))
\end{align*}
Under no unmeasured confounding and positivity, we could give a causal interpretation to  contrasts such as $\bar{\boldsymbol\phi}_1(t_0,t)-\bar{\boldsymbol\phi}_0(t_0,t)$, $\bar{\boldsymbol\phi}_1(t_0,t)\oslash\bar{\boldsymbol\phi}_1(t_0,t)$ or $\boldsymbol{1}\boldsymbol{1}^T-\bar{\boldsymbol\phi}_1(t_0,t)\oslash E_{\boldsymbol{Z},X}(\boldsymbol\phi(t_0,t|X\boldsymbol{Z}))$.
These marginal estimators can be interpreted as applications of the parametric g-formula .

\begin{figure}
\begin{center}
\begin{tikzpicture}[->,bend angle=20,semithick,>=stealth']
\matrix [matrix of nodes,row sep=6mm,column sep=7mm,nodes={align=center},
ampersand replacement=\&]
{
\& |(Z)| $\boldsymbol{Z}$  \\
|(X)| $X$ \& \& |(Y)| $\boldsymbol{\phi}(t_0,t|X,\boldsymbol{Z})$ \\
};
\begin{scope}[every node/.style={midway,auto}]
\draw (X) to node {} (Y);
\draw (Z) to node {} (X);
\draw (Z) to node {} (Y);
\end{scope}
\end{tikzpicture}
\caption{Directed acyclic graph showing predictions $\boldsymbol{\phi}(t_0,t|X,\boldsymbol{Z})$ as a function of the target exposure $X$ and confounders $\boldsymbol{Z}$.}
\end{center}
\end{figure}
% Do notation

\subsection*{Example 4: Ratios}

% We can also consider differences and ratios. Let $\boldsymbol\phi(t_0,t|\boldsymbol{x})$ represent one of the estimators from the previous sections.
% The gradient for the difference $\boldsymbol\phi(t_0,t|\boldsymbol{x}_1)-\boldsymbol\phi(t_0,t|\boldsymbol{x}_2)$ is
% \begin{align*}
% \left.\frac{\partial}{\partial \beta_m} \left(\boldsymbol\phi(t_0,t|\boldsymbol{x}_1) - \boldsymbol\phi(t_0,t|\boldsymbol{x}_2)\right)\right|_{\beta_m=\hat\beta_m} &= \boldsymbol{\phi}'_m(t_0,t|\boldsymbol{x}_1) - \boldsymbol\phi'_m(t_0,t|\boldsymbol{x}_2)
% \end{align*}

We can augment the system of differential equations to calculate the ratios for specific estimators \citep{Ryalen_Stensrud_Roysland_2018}. Let a matrix of ratios of state occupation probabilities be $\text{\bf R}(t_0,t|\boldsymbol{x}_1,\boldsymbol{x}_2)=\boldsymbol{\pi}(t_0,t|\boldsymbol{x}_1)\oslash\boldsymbol{\pi}(t_0,t|\boldsymbol{x}_2)$. Then
\begin{align*}
\frac{d\text{\bf R}(t_0,t|\boldsymbol{x}_1,\boldsymbol{x}_2)}{dt}^T &= \left(\left(\boldsymbol{\pi}(t_0,t|\boldsymbol{x}_1)^T\boldsymbol{P}(t_0,t|\boldsymbol{x}_1)\boldsymbol{Q}(t|\boldsymbol{x}_1)\right) - \left(\boldsymbol{\pi}(t_0,t|\boldsymbol{x}_2)^T\boldsymbol{P}(t_0,t|\boldsymbol{x}_2)\boldsymbol{Q}(t|\boldsymbol{x}_2)\right) \circ \text{\bf R}(t_0,t|\boldsymbol{x}_1,\boldsymbol{x}_2)\right) \\
\text{\bf R}(t_0,t_0) &= \boldsymbol{1}
\end{align*}
where $\boldsymbol{P}(t_0,t|\boldsymbol{x}_2)>0$. Alternatively, we can calculate the gradient for $\log(\boldsymbol\phi(t_0,t|\boldsymbol{x}_1)\oslash\boldsymbol\phi(t_0,t|\boldsymbol{x}_2))$ (that is, the log ratio), which is
\begin{align*}
\left.\frac{\partial}{\partial \beta_m} \log\left(\boldsymbol\phi(t_0,t|\boldsymbol{x}_1)\oslash\boldsymbol\phi(t_0,t|\boldsymbol{x}_2)\right)\right|_{\beta_m=\hat\beta_m} &= \boldsymbol{\phi}'_m(t_0,t|\boldsymbol{x}_1)\oslash\boldsymbol\phi(t_0,t|\boldsymbol{x}_1)
- \boldsymbol{\phi}'_m(t_0,t|\boldsymbol{x}_2)\oslash\boldsymbol\phi(t_0,t|\boldsymbol{x}_2)
\end{align*}
This evaluation depends on $\boldsymbol\phi(t_0,t|\boldsymbol{x}_2)$ being non-zero.

% Open question: how does this compare with log(ratio)?

% Representation of different interventions - how?

\subsection*{Example 5: Utilities and costs}

The approach readily incorporates utilities and costs. For utilities, we have the cumulative discounted utilities $U_{i}(t_0,t)=\sum_j \int_{t_0}^t P_{ij}(t_0,v)u_j(v)e^{-\lambda v}dv$, where $u_j(v)$ is the utility for state $j$ at time $v$ and $\lambda=\log(1+\delta)$ is the rate of decline for a discount rate $\delta$. The augmented differential equations are then
\begin{align}
\frac{d\boldsymbol{U}(t_0,t)}{dt} &= \boldsymbol{P}(t_0,t)\boldsymbol{u}(t)e^{-\lambda t} \label{eq:utilities} \\
\frac{d\boldsymbol{U}'_m(t_0,t)}{dt} &= \left(\boldsymbol{P}'_m(t_0,t)\boldsymbol{u}(t)+\boldsymbol{P}(t_0,t)\boldsymbol{u}'_m(t)\right)e^{-\lambda t} \label{eq:dUudt} \\
\boldsymbol{U}(t_0,t_0) &= \boldsymbol{U}'_m(t_0,t_0) = \boldsymbol{0}
\end{align}
In health economics, the discounted, quality-adjusted life-years are calculated by the product of the initial state probabilities and the utilities, such that $\text{QALY}(t_0,t)=\boldsymbol{\pi}(t_0,t_0)^T\boldsymbol{U}(t_0,t)$, with gradient $\text{QALY}'_m(t_0,t)=\boldsymbol{\pi}(t_0,t_0)^T\boldsymbol{U}'_m(t_0,t)$.

Costs may be represented as accumulated costs or costs at a point in time. Let costs per unit time for being in state $i$ at time $t$ be represented by the vector function $\boldsymbol{c}(t)=(c_{i}(t))$ and model for costs for transitions from state $i$ to state $j$ at time $t$, represented by the matrix $\boldsymbol{\mathcal{C}}(t)=(\mathcal{C}_{ij}(t))$, with $\mathcal{C}_{ii}(t)=0$. Then the cumulative discounted costs $\boldsymbol{C}(t_0,t)$ can be represented by the equations
\begin{align}
\frac{d\boldsymbol{C}(t_0,t)}{dt}
&= \boldsymbol{P}(t_0,t)
\left(\boldsymbol{c}(t)+\left(\boldsymbol{Q}(t)\circ\boldsymbol{\mathcal{C}}(t)\right)\bm{1}\right)
e^{-\lambda t} \label{eq:costs} \\
\frac{d\boldsymbol{C}'_m(t_0,t)}{dt}
&= \Big(\boldsymbol{P}'_m(t_0,t)
\left(\boldsymbol{c}(t)+\left(\boldsymbol{Q}(t)\circ\boldsymbol{\mathcal{C}}(t)\right)\bm{1}\right) +\nonumber \\
\left(\boldsymbol{c}'_m(t)+\left(\boldsymbol{Q}'_m(t)\circ\boldsymbol{\mathcal{C}}(t)+\boldsymbol{Q}(t)\circ\boldsymbol{\mathcal{C}}'_m(t)\right)\bm{1}\right)\Big)e^{-\lambda t}  \\
\boldsymbol{C}(t_0,t_0) &= \boldsymbol{C}'_m(t_0,t_0) = \boldsymbol{0}
\end{align}

Note that these require the evaluation of the gradients for the utility and cost functions.
The total costs are calculated by weighting by the initial state probabilities, such that
$\text{Costs}(t_0,t)=\boldsymbol{\pi}(t_0,t_0)^T\boldsymbol{C}(t_0,t)$, with gradient $\text{Costs}'_m(t_0,t)=\boldsymbol{\pi}(t_0,t_0)^T\boldsymbol{C}'_m(t_0,t)$.
We can also consider incremental cost-effectiveness ratios, estimated by
\begin{align*}
\text{ICER}(t_0,t|\boldsymbol{x}_1,\boldsymbol{x}_2) &= \frac{\text{Costs}(t_0,t|\boldsymbol{x}_1)-\text{Costs}(t_0,t|\boldsymbol{x}_2)}{\text{QALY}(t_0,t|\boldsymbol{x}_1)-\text{QALY}(t_0,t|\boldsymbol{x}_2)}
\end{align*}
The gradient of the log of the ICER is
\begin{align*}
\left.\frac{\partial}{\partial\beta_m}\log(\text{ICER}(t_0,t|\boldsymbol{x}_1,\boldsymbol{x}_2))\right|_{\beta_m=\hat\beta_m}
&= \frac{\text{Costs}'_m(t_0,t|\boldsymbol{x}_1)-\text{Costs}'_m(t_0,t|\boldsymbol{x}_2)}{\text{Costs}(t_0,t|\boldsymbol{x}_2)-\text{Costs}(t_0,t|\boldsymbol{x}_1)} - \\
\end{align*}

Finally, we can expand the equations to record the utilities or costs for different states. We can extend Equations~\eqref{eq:utilities} and \eqref{eq:costs} to
\begin{align*}
\frac{d\tilde{\boldsymbol{U}}(t_0,t)}{dt} &= \left(\boldsymbol{P}(t_0,t)\circ\left(\bm{1}\,\boldsymbol{u}(t)^T\right)\right)e^{-\lambda t} \\
\frac{d\tilde{\boldsymbol{C}}(t_0,t)}{dt}
&= \left(\boldsymbol{P}(t_0,t) \circ
\left(\bm{1}\,\boldsymbol{c}(t)^T+\bm{1}\left(\left(\boldsymbol{Q}(t)\circ\boldsymbol{\mathcal{C}}(t)\right)\bm{1}\right)^T\right)\right)
e^{-\lambda t}
\end{align*}
for matrices $\tilde{\bm{U}}$ and $\tilde{\bm{C}}$, where $\tilde{U}_{ij}$ is the cumulative utility for state $j$ given an initial state $i$, and $\tilde{C}_{ij}$ is the cumulative cost for being in state $j$ or a transition from state $j$ given an initial state $i$.

\subsection*{Example 6: Transformations}

For causal inference and health economic evaluations, it is often useful to combine evidence from different data sources to transform the baseline transitions. We introduce five such transformations. {First}, we can turn off a transition, where the transformed hazard $h^*(t|\bm{x})=0\times h(t|\bm{x})$. One advantage of this zero-model formulation is that we can keep the same variance-covariance matrix for the baseline transitions, which supports a straightforward comparison between the interventions.
{Second}, we may have a hazard ratio $\exp(\phi)$ with 95\% confidence interval $(\exp(\phi_l),\exp(\phi_u))$, such as from a meta-analysis. The variance for the log hazard ratio is $\left((\phi_u-\phi_l)/2/1.96\right)^2$. The transformed hazard is $h^*(t|\bm{x},\phi)=\exp(\phi) h(t|\bm{x})$, with the gradient of the hazard $h^*$ with respect to the parameters of $h(t|\bm{x})$ being $\exp(\phi)\nabla h(t|\bm{x})$, and the gradient with respect to $\phi$ being $\exp(\phi)h(t|\bm{x})$.
{Third}, we may have an acceleration factor $\exp(\phi)$. Then the transformed hazard is $h^*(t|\bm{x},\phi)=\exp(\phi) h(\exp(\phi)t|\bm{x})$, with the gradient of the hazard $h^*$ with respect to the parameters of $h(t\exp(\phi)|\bm{x})$ being $\exp(\phi)\nabla h(t\exp(\phi)|\bm{x})$, and the gradient with respect to $\phi$ being $t\exp(2\phi)h'(t\exp(\phi)|\bm{x})+\exp(\phi)h(t\exp(\phi)|\bm{x})$.
{Fourth}, we can add transition hazards together, such that the transformed hazard is $h^*(t|\bm{x})=h_1(t|\bm{x})+h_2(t|\bm{x})$. The gradient of the sum is equal to the sum of the gradients. {Fifth}, we can include smooth mathematical functions with no uncertainty. A useful example of this would be to use a spline interpolation function for the log of background mortality rates from vital statistics for use in relative survival (or excess hazards) modelling. The spline interpolation has the advantage of being smooth for the ordinary differential equation solver.

\subsection*{Variance and interval estimation}

For an estimator $\boldsymbol\phi(t_0,t)$ with support on the real line, a confidence interval can be calculated from the variance-covariance matrix $\boldsymbol{V} = \boldsymbol{\phi}'_m(t_0,t)\, \hat{\boldsymbol\Sigma}\, \boldsymbol{\phi}'_m(t_0,t)$ and normal quantiles $z$ at the $\alpha$ level with bounds $\boldsymbol\phi(t_0,t)\pm z_{(1-\frac{\alpha}{2})}\sqrt{\text{diag}(\boldsymbol{V})}$.
For estimators that are on the open interval $(0,1)$, the gradient can be calculated using an identity or logit transformations. Similarly, for estimators that are on the open interval $(0,\infty)$, the gradient can be calculated using an identity of log transformations.
% We could present results for the different transformations...

% \begin{align*}
%   \frac{\partial \log(-\log(\boldsymbol\phi(t_0,t)))}{\partial\hat\beta_m}  &= \frac{\partial\boldsymbol\phi(t_0,t)}{\partial\hat\beta_m}\circ\frac{\boldsymbol{1}}{\log(\boldsymbol\phi(t_0,t))\circ\boldsymbol\phi(t_0,t)} \\
%   \boldsymbol{V} &=\text{var}(\log(-\log(\boldsymbol{\phi}(t_0,t)))) \\ &= \log(-\log(\boldsymbol{\phi}'_m(t_0,t)))\, \hat{\boldsymbol\Sigma}\, \log(-\log(\boldsymbol{\phi}'_m(t_0,t)))
% \end{align*}
% with bounds $\exp\left(-\exp(\log(-\log(\boldsymbol\phi(t_0,t)))\pm z_{(1-\frac{\alpha}{2})}\text{diag}(\boldsymbol{V}))\right)$.

% \begin{align*}
%   \frac{\partial \log(\boldsymbol\phi(t_0,t))}{\partial\hat\beta_m}  &= \frac{\partial\boldsymbol\phi(t_0,t)}{\partial\hat\beta_m}\circ\frac{\boldsymbol{1}}{\boldsymbol\phi(t_0,t)} \\
%   \boldsymbol{V} &=\text{var}(\log(\boldsymbol{\phi}(t_0,t))) \\ &= \log(\boldsymbol{\phi}'_m(t_0,t))\, \hat{\boldsymbol\Sigma}\, \log(\boldsymbol{\phi}'_m(t_0,t))
% \end{align*}
% with bounds $\exp\left(\log(\boldsymbol\phi(t_0,t))\pm z_{(1-\frac{\alpha}{2})}\text{diag}(\boldsymbol{V})\right)$.

\subsection*{Integration of hazards that are functions of log(time)}
Many parametric survival models are implemented in terms of log(time), including flexible parametric survival models and accelerated failure time models. Integration of the ordinary differential equation solvers from the origin for such models can lead to numerical issues. We offer two approaches to address these issues. First, we can truncate small values for time, such that $t^*=\text{max}(t,\epsilon)$, e.g. using $\epsilon=$1e-5. Defining cumulative hazards as $H_{ij}(t_0,t)=\int_{t_0}^t Q_{ij}(u)du$, the value for $H_{ij}(0,\epsilon)$ may be poorly estimated by $\epsilon Q_{ij}(\epsilon)$. Moreover, the hazards will generally not be smooth at $t=\epsilon$.

Second, we could directly calculate initial values at $t=\epsilon$ using cumulative hazard estimates. Let the matrix of cumulative hazard intensities from 0 to $\epsilon$ be $\bm{H}(t_0,t)=(H_{ij}(t_0,t))$ and let $\bm{H}'_m(t_0,t)=\int_{t_0}^t \bm{Q}_m'(u)du$. The transition probabilities at $\epsilon$ can be calculated approximately by
\begin{align*}
\bm{P}(0,\epsilon) &\approx \text{mexp}(\bm{H}(0,\epsilon)) \\
\bm{P}'_m(0,\epsilon) &\approx \bm{H}'_m(0,\epsilon) \\
\bm{L}(0,\epsilon) &\approx \epsilon(\bm{I}+\bm{P}(0,\epsilon))/2 \\
\bm{L}'_m(0,\epsilon) &\approx \epsilon\bm{P}'_m(0,\epsilon)/2 \\
\bm{U}(0,\epsilon) &\approx \epsilon\bm{P}(0,\epsilon)\bm{u}(\epsilon)e^{-\lambda\epsilon}/2 \\
\bm{U}'_m(0,\epsilon) &\approx \epsilon(\bm{P}'_m(0,\epsilon)\bm{u}(\epsilon) + \bm{P}(0,\epsilon)\bm{u}'_m(\epsilon))e^{-\lambda\epsilon}/2 \\
\end{align*}
where $\text{mexp}(\bm{M})$ is the matrix exponential for a matrix $\bm{M}$.

\section{Implementation}

We have implemented the algorithm in R as the \code{rstpm2::markov_msm} function. The current implementation allows for: independent models for each of the transitions; a rich set of models for each transitions (see Appendix B); predictions for state occupation and transition intensities, length of stay, utilities and costs; and useful post-processing facilities, including weighted standardisation, differences and ratios.

% A basic call to \code{markov_msm()} to estimate the state occupancy probabilities and length of stay would be \code{markov_msm(x, trans, t, newdata)}, where \code{x} is a list of regression models for the transition intensities, \code{trans} is a matrix that defines the multistate transitions as per the \code{mstate} package, \code{t} is a vector of times to report (including the initial time), and \code{newdata} is a data-frame with the covariates for predictions.

% The current implementation assumes independently estimated transition intensities. The models for the transition intensities need to provide methods for: (i) the covariance matrix (\code{vcov}); (ii) predictions for hazards (\code{predict(object, newdata, type="haz", ...)}); and (iii) predictions for gradients of the hazards (\code{predict(object, newdata, type="gradh", ...)}). These methods have been provided for \code{rstpm2::stpm2}, \code{rtspm2::pstpm2}, \code{stats::glm} for a log link, and \code{survPen::survPen}. Use of \code{stats::glm} will also require that the argument \code{tvar} to specify a character for the name (or vector of names) of the time variables in the regression models for predictions. A transition can be turned off by wrapping a regression object in \code{zeroRate()}.
% % Should I use a common S3 or S4 class?

% The object returned from \code{markov_msm()} is of class \code{"markov_msm"}. The object includes predictions and gradients for the times specified for the covariates in the input \code{newdata} data-frame. Methods are provided for: \code{print}; \code{as.data.frame}; \code{subset(x, subset, ...)} based on a restriction of \code{newdata}; \code{standardise}, which averages across covariate patterns and returns another \code{markov_msm} object. Other functions include \code{diff(x,y)} which takes a difference between two \code{markov_msm} objects, and \code{ratio_markov_msm}, which gives the ratio of two \code{markov_msm} objects; both of these functions return objects that inherit from \code{"markov_msm"}.

% Differences from the same models are just different weights...

\section{Simulations}

% The research questions included assessing the statistical properties (including bias and coverage) for: (i) different transformations of state occupation probabilities and length of stay; (ii) different regression models; and (iii) sample size.
% We used exponentially distributed data from an illness-death model to estimate the proportion that are healthy, ill or dead using (a) different sample sizes (500, 1000, 10000), (b) different regression models (Poisson regression, Weibull regression), and (c) different transformations for state occupation probabilities (identity, log, logit, and arcsin) and length of stay (identity, log).

To assess the small sample properties compared with the asymptotics properties, we undertook several simulations.
We fitted a Markov illness-death model with Weibull transition rates with a shape parameter of 1.5 and a scale parameter of 10 for the transitions Healthy $\rightarrow$ Illness, Healthy $\rightarrow$ Death, and Illness $\rightarrow$ Death. We simulated for 1000 individuals with censoring min(20,Uniform(0,30)). We fitted the transitions using Weibull regression using the \code{aftreg} model from the \pkg{eha} package. The expected state occupation probabilities were predicted using the ODE solver with the true parameters. For each simulation, we predicted the state occupation probabilities, bias, confidence intervals using five transformations (plain confidence intervals, and confidence intervals based on log-log, log, logit and arcsin transformations), and coverage. We then calculated the mean bias, mean squared error and coverage across 1000 simulations.

\begin{figure}[!ht]
\centering
\includegraphics{aft-weibull-20190816.pdf}
\caption{State occupation coverage for a Markov illness-death model with Weibull-distributed times, fitted using Weibull regression models. The panels represent confidence intervals based on different transformation methods.}
\label{fig:coverage}
\end{figure}

Across the 1000 simulations, the mean bias for the state occupation probabilities for any of the three states across time varied between -0.0006 and 0.0008, while the mean squared error varied between 0 and 0.0002. Coverage was generally close to 0.95 for each of the transformation methods (Figure~\ref{fig:coverage}), particularly given that the binomial variability at 950/1000 has a 95\% confidence interval of (0.945,0.963). The plain'' untransformed approach performs poorly close to start for the initial state.

\begin{figure}[!ht]
\centering
\includegraphics[width=0.7\linewidth]{aft-weibull-los-20190816.pdf}
\caption{Length of stay coverage for a Markov illness-death model with Weibull-distributed times, fitted using Weibull regression models. The panels represent confidence intervals based on different transformation methods.}
\label{fig:coverage-los}
\end{figure}

Across the 1000 simulations, the mean bias for the length of stay for any of the three states across time varied between -0.006 and 0.008 , while the mean squared error varied between 0 and 0.004. Coverage was generally close to 0.95 for both of the transformation methods (Figure~\ref{fig:coverage-los}).

\section{Example}

We will extend an analysis by \cite{Crowther_Lambert_2017} of the Rotterdam Breast Cancer Data \citep{sauerbrei2007,foekens2000}. This dataset includes patients who were treated for primary breast cancer in Rotterdam during 1978--1993. Treatment included primary surgery, with either mastectomy or breast conserving therapy, with possible referral for radiation treatment within three months of surgery. Study exclusion criteria included: (i) patient tissue based on biopsy only; (ii) metastatic disease at primary surgery or within one month of surgery; (iii) relapse or residual disease within one month of surgery; (iv) relapse prior to referral to radiation therapy; or (v) a previous primary cancer. For the analysis, patients were also excluded if they had adjuvant treatment but were node-negative, or if they had missing information on the number of positive nodes. After these exclusions, data were available for 2982 patients.

Following the analysis by \cite{Crowther_Lambert_2017}, we model for three states, including (1) the initial \emph{post-surgery} state, (2) \emph{relapse} and (3) \emph{death}, with transitions (1)$\rightarrow$(2) for death from the post-surgery state with hazard $h_1(t)$, (1)$\rightarrow$(3) for relapse with hazard $h_2(t)$, and (2)$\rightarrow$(3) for death from the relapse state with hazard $h_3(t)$. All three transitions are modelled using time since surgery as the primary time scale. Crowther and Lambert used flexible parametric survival models on the \emph{log cumulative hazard scale}, where $H_j(t|\bm{x})=\int_0^th_j(u|\bm{x})du=\exp(s_j(t)+\eta_j(t,\bm{x}))$ for baseline $s_j(t)$ and linear predictors $\eta_j(t,\bm{x})$ defined by
\begin{align*}
\eta_1(t,\bm{x}) &= \beta_1\texttt{age}+\beta_2\texttt{nodes}+\beta_3\texttt{hormon} + I(20<\texttt{size}\leq 50)s_{4}(t) + \\
\eta_2(t,\bm{x}) &= \beta_1\texttt{age}+\beta_2\texttt{nodes}+\beta_3\texttt{hormon} + \\
\eta_3(t,\bm{x}) &= \beta_1\texttt{age}+\beta_2\texttt{nodes}+\beta_3\texttt{hormon} + \\
\end{align*}
where \texttt{age} is the age in years at cancer treatment, \texttt{nodes} is the number of positive lymph nodes, \texttt{hormon} is an indicator for whether the patient was on hormonal therapy, \texttt{size} is the tumour size (mm), \texttt{pr\_1} is the log of one plus the progesterone level (fmol/L), $I(p)$ is an indicator function with value 1 when the predicate $p$ is true and value 0 otherwise, $s_0(t)=\sum_{j=0}^3 B_j(\log(t))\beta_{0j}$ for a natural splines basis function $B_j()$ with knots at the quantiles for log of the events times, and $s_k(t)=\beta_{k0}+\beta_{k1}\log(t)$ for $k>1$. The parameters are assumed to be distinct between the linear predictors.
The hazards are calculated from the cumulative hazard using $h_j(t|\bm{x})=H_j'(t|\bm{x})$. Note that ratios of these hazards may not have a simple interpretation, particularly with continuous or multiple time-varying effects (see Appendix). Moreover, age is expected to be closely related to the mortality rates and modelling using a single parameter may not capture this important variation. For ease of comparison, we do not further investigate changes in the model formulations and focus on predictions from these fitted models.

For predictions, Crowther and Lambert used \emph{conditional} predictions for a patient aged 54 years with a transformed progesterone level of 3 (that is, progesterone $=\exp(3)-1=19.1$ fmol/L), with the number of positive lymph nodes varying between 0, 10 and 20, and with each of the three levels for tumour size. We are able to model and predict for conditional relationships in less than 20 seconds (code included in the documentation for the \texttt{rstpm2} package on CRAN). We now focus on \emph{marginal} or \emph{standardised} predictions for patients aged 50--59 years. In particular, we will compare state occupation probabilities and length of stay under counterfactual tumour sizes.

Let the subjects aged 50--59 years be indexed by $k=1\ldots K$ and let $X$ represent tumour size and $\bm{Z}$ represent the other covariates. Then the predictions under the counterfactual $\hat{X}=x$ are modelled by
\begin{align*}
P_{1j}(0,t|\hat{X}=x) &= E_{\bm{Z}}(P_{1j}(0,t|\bm{Z},\hat{X}=x))\\
&= \frac{1}{K} \sum_{k=1}^K P_{1j}(0,t|\bm{Z}=\bm{z}_k,\hat{X}=x) \\
L_{1j}(0,t|\hat{X}=x) &= E_{\bm{Z}}(L_{1j}(0,t|\bm{Z},\hat{X}=x))
\end{align*}
The specific counterfactuals are that all patients have a tumour size that is either (i) $\leq$ 20 mm, (ii) $>$ 20 mm and $\leq$ 50 mm, or (iii) $>$ 50 mm. The age restriction is due to the strong age dependence for each of the transitions. Differences in state occupation probabilities and length of stay by the counterfactuals are show in Figures~\ref{fig:1} and \ref{fig:2}, respectively. There is clear evidence that smaller tumour sizes are associated with fewer early relapses and a lower risk of death for fifteen years after treatment.

\begin{figure}[!ht]
\centering
\includegraphics{multistate-fig-1.pdf}
\caption{Differences in standardised state occupation probabilities by state and by contrasts for counterfactual tumour size, for female breast cancer patients aged 50--59 years and diagnosed 1978--1993, Rotterdam}
\label{fig:1}
\end{figure}

\begin{figure}[!ht]
\centering
\includegraphics{multistate-fig-2.pdf}
\caption{Differences in standardised length of stay by state and by contrasts for counterfactual tumour size, for female breast cancer patients aged 50--59 years and diagnosed 1978--1993, Rotterdam}
\label{fig:2}
\end{figure}

We also explored other model formulations, including penalised log-hazard models ($h_j(t|\bm{x})=\exp(s_j(t)+\eta_j(t,\bm{x}))$) and accelerated failure time models ($S_j(t|\bm{x})=S_{0j}(\int_0^t\exp(\eta_j(t,\bm{x}))du)$, where $S_{0j}(t)=\exp(-\exp(s_j(t)))$).

\section{Discussion}

In summary, we describe how to predict from Markov multi-state models with smooth transition intensities using ordinary differential equations. A variety of predictions can be estimated, together with interval estimation based on the multivariate delta method.
The method is suitable for a range of models, including Poisson regression, parametric and flexible accelerated failure time models, and parametric and penalised generalised survival models. Applications of these methods could range from descriptive epidemiology, to causal inference, through to health economic evaluations of cost-effectiveness.

The recent article by \citet{Ryalen_Stensrud_Roysland_2018} provides a non-parametric analogue to our development. Those non-parametric methods can readily be extended to many of the estimators described herein, including standardisation, quality-adjusted life-years and costs. The asymptotic theory for the non-parametric approach has been shown to hold for Aalen's additive hazards model; it is unclear whether non-parametric estimators for accelerated failure and proportional hazards models will satisfy the assumptions in Theorems 1 and 2 of \citet{Ryalen_Stensrud_Roysland_2018}. Moreover, it is arguable when an additive hazards scale is suitable for covariate adjustment, for example, for modelling all-cause or cause-specific survival with age at cancer diagnosis as a covariate. We suggest that (smooth) accelerated failure time models may provide a useful alternative regression model framework which is also collapsible.

Under the conditions described by \citet{Datta_Satten_2001}, predictions that are functions of the state occupation probability are, under suitable regularity assumptions, expected to be consistently estimated irrespective of whether the Markov assumption holds. However other predictions (e.g. transition probabilities, or the proportion to ever pass through a state) may not be consistently estimated. Moreover, the efficiency of the predictions when the time scale has been mis-specified may be low. Alternative approaches include individual-based simulations, with variance estimation using the bootstrap \cite{Crowther_Lambert_2017}, which are expected to be computationally expensive.

\bibliography{lib}

\section*{Appendix A: Time-varying effects for log cumulative hazards models}

For modelling on the log cumulative hazard scale with a linear predictor $\eta(t,\bm{x}_i)$ given time $t$ and covariates $\bm{x}_i$, the cumulative hazard $H(t|\bm{x}_i)$ can be represented by
\begin{align*}
H(t|\bm{x}_i) &= \exp(\eta(t,\bm{x}_i)) \\
\implies h(t|\bm{x}_i) &= \exp(\eta(t,\bm{x}_i)) \frac{\partial}{\partial t} \eta\left(t,\bm{x}_i\right)
\end{align*}
where $h(t|\bm{x}_i)$ is the hazard. The \emph{hazard ratio} comparing covariates $\bm{x}_1$ and $\bm{x}_0$ is then
\begin{align}
\frac{h(t|\bm{x}_1) }{h(t|\bm{x}_0) } &= \exp(\eta(t,\bm{x}_1)-\eta(t,\bm{x}_0))
\left(1+\frac{\frac{\partial}{\partial t} \left(\eta\left(t,\bm{x}_1\right)-\eta\left(t,\bm{x}_0\right)\right)}{\frac{\partial}{\partial t} \eta\left(t,\bm{x}_0\right)}\right)\label{eq:hrgen}
\end{align}
We have the following cases for the interpretation of the hazard ratios:
\begin{itemize}
\item[Case 1] \emph{Time-independent effects}: if $\frac{\partial}{\partial t} \left(\eta\left(t,\bm{x}_1\right)-\eta\left(t,\bm{x}_0\right)\right)=0$, then Equation~\eqref{eq:hrgen} gives the ratio $\exp(\eta(t,\bm{x}_1)-\eta(t,\bm{x}_0))$.  This case will hold when the difference in the linear predictors is independent of $t$.
Usefully, the log hazard ratio is then the difference in the linear predictors, which has a straightforward interpretation.

A sufficient condition is that the covariates that change between $\bm{x}_1$ and $\bm{x}_0$ are independent of $t$. % Is this also a necessary condition?
As a specific example, for a linear predictor $\eta(t,\bm{x}_i)=s_0(t)+\eta_0(\bm{x}_i)$ for baseline $s_0(t)$ and linear predictor $\eta_0(\bm{x}_i)$, then the log hazard ratio equals $\eta_0(\bm{x}_1)-\eta_0(\bm{x}_0)$, which is a \emph{proportional hazards} model.
\item[Case 2] \emph{Small time-varying effects}: if $\frac{\partial}{\partial t} \left(\eta\left(t,\bm{x}_1\right)-\eta\left(t,\bm{x}_0\right)\right)\ll \frac{\partial}{\partial t} \eta\left(t,\bm{x}_0\right)$, then $\frac{h(t|\bm{x}_1) }{h(t|\bm{x}_0) }\approx\exp(\eta(t,\bm{x}_1)-\eta(t,\bm{x}_0))$. For this case, the partial derivative with respect to time for the change in the linear predictor is small compared with partial derivative for the linear predictor for $\bm{x}_0$. This suggests that the interpretation for moderately small time-varying effects will be straightforward (up to an approximation).
\item[Case 3] \emph{Stratified model}: if we have a stratified linear predictor $\eta(t,\bm{x}_i,j)=s_j(t)+\eta_j(\bm{x}_i)$ for stratum $j$ with baseline $s_j(t)$ and linear predictor $\eta_j(\bm{x}_i)$, then
\begin{align}
\frac{h(t|\bm{x}_1,j) }{h(t|\bm{x}_0,k) }
&= \exp\left(
s_j(t)+\eta_j(\bm{x}_1)+\log(s_j'(t))-
\left(s_k(t)+\eta_k(\bm{x}_0)+\log(s_k'(t))\right)
\right) \label{eq:stratified}
\end{align}
If $j=k$ then $\frac{h(t|\bm{x}_1,j) }{h(t|\bm{x}_0,k)}=\exp(\eta_j(\bm{x}_1)-\eta_j(\bm{x}_0))$, which is proportional hazards. If $j\neq k$, then Equation~\eqref{eq:stratified} can be interpreted as having separable effects (on the log hazard scale) for the time effects and for the other covariates.
\item[Case 4] \emph{Linear time-varying effect}: if we have a linear predictor $\eta(t,x_i,\bm{u}_i)=s_0(t)+x_i s_1(t)+\eta_0(\bm{v}_i)$ for a scalar $x_i$ and a vector of other covariates $\bm{v}_i$, then
\begin{align*}
\frac{h(t|x_1,\bm{v}_1) }{h(t|x_0,\bm{v}_0) }
&= \exp\left((x_1-x_0)s_1(t)+\eta_0(\bm{v}_1)-\eta_0(\bm{v}_0)\right)\left(1+\frac{(x_1-x_0)s_1'(t)}{s_0'(t)+x_0s_1'(t)}\right)
\end{align*}
From the ratio $(x_1-x_0)s_1'(t)/(s_0'(t)+x_0s_1'(t))$, we see that the time-varying hazard ratio depends on both the baseline value $x_0$ and the difference $x_1-x_0$. The interpretation of the effect will be straightforward when $x_i$ is for a binary indicator (e.g. $x_0=0$ and $x_1=1$).
\item[Case 5] \emph{Multiple time-varying effects}: if we have a linear predictor $\eta(t,x_i,\bm{u}_i)=s_0(t)+x_i s_1(t)+u_i s_2(t)+\eta_0(\bm{v}_i)$ for scalars $x_i$ and $u_i$ and a vector of other covariates $\bm{v}_i$, then
\begin{align*}
\frac{h(t|x_1,u_1,\bm{v}_1) }{h(t|x_0,u_0,\bm{v}_0) }
&= \exp\left((x_1-x_0)s_1(t)+(u_1-u_0)s_2(t)+\eta_0(\bm{v}_1)-\eta_0(\bm{v}_0)\right)\times\\
\end{align*}
From the ratio in the last line, we see that the hazard ratio depends on both the baseline values $x_0$ and $u_0$ and the differences $x_1-x_0$ and $u_1-u_0$. The multiple effects will be straightforward to interpret if $x_i$ and $u_i$ are binary indicators for strata (see Case 3).
% \item[Case 6] Otherwise, the hazard ratio for time-varying effects may depend on the combinationof covariate values of .
\end{itemize}

In summary, time-independent effects and stratified models have a straightforward interpretation, however continuous time-varying effects and multiple time-varying effects on a log cumulative hazard scale are more difficult to interpret in terms of hazard ratios.

\section*{Appendix B: Hazard specifications}

\begin{table}[!ht]
\centering
\begin{minipage}{\textwidth}
\begin{tabular}{llll}
Class & Model & R function            & Specification\footnote{Notes: $\eta(t,\bm{x})$ and $\eta(\bm{x})$ are linear predictors for time $t$ and covariates $\bm{x}$, and $S_0(t)$ is a baseline survival model.} \\ \hline
Parametric & Poisson regression & \texttt{stats::glm} & $\log(h(t|\bm{x})) = \eta(t,\bm{x})$ \\
& Accelerated failure time & \texttt{flexsurv::flexsurvreg} & $S(t|\bm{x})=S_0(\exp(\eta(\bm{x}))t)$ \\
&  & \texttt{eha::aftreg} & $S(t|\bm{x})=S_0(\exp(\eta(\bm{x}))t)$ \\
&& \texttt{rstpm2::aft} & $S(t|\bm{x})=S_0(\int_0^t \exp(\eta(u,\bm{x}))du)$\\
& Generalized survival & \texttt{flexsurv::flexsurvspline} & $\log(H(t|\bm{x})) = \eta(t,\bm{x})$ \\
&& \texttt{rstpm2::stpm2} & $\log(H(t|\bm{x})) = \eta(t,\bm{x})$ \\
Penalised & Poisson regression & \texttt{mgcv::gam}  & $\log(h(t|\bm{x})) = \eta(t,\bm{x})$ \\

& Log-hazard & \texttt{survPen::survPen} & $\log(h(t|\bm{x})) = \eta(t,\bm{x})$ \\
& Generalised survival & \texttt{rstpm2::pstpm2} & $\log(H(t|\bm{x})) = \eta(t,\bm{x})$ \\
Transformation & Zero & \texttt{rstpm2::zeroModel} & $h^*(t|\bm{x})=0$ \\
& Hazard ratio & \texttt{rstpm2::hrModel} & $h^*(t|\bm{x},\phi)=h(t|\bm{x})\phi$ \\
& Accelerated failure & \texttt{rstpm2::aftModel} & $h^*(t|\bm{x},\phi)=h(\phi t|\bm{x})\phi$ \\
& Additive models  & \texttt{rstpm2::addModel} & $h^*(t|\bm{x})=h_1(t|\bm{x})+h_2(t|\bm{x})$ \\
& Hazard function & \texttt{rstpm2::hazFun} & $h(t|\bm{x})$ \\
& Spline interpolation & \texttt{rstpm2::splineFun} & $h(t|\bm{x})=\exp(s(t))$
\end{tabular}
\end{minipage}
\caption{Hazard specifications}
\label{tab:models}
\end{table}

For Table~\ref{tab:models}, the generality of the linear predictor may be constrained by the model class; For example, \texttt{flexsurv::flexsurvspline} only allows for spline interactions between time and covariates. The baseline survival models for \texttt{flexsurv::flexsurvreg} and \texttt{eha::aftreg} are from a parametric family, while the baseline survival for \texttt{rstpm2::aft} is where the log cumulative hazards are based on splines. The generalised survival models also allow for other transformations, including proportional odds, probit and additive hazards models. Arbitrary smooth hazards can be defined using \texttt{hazFun}; as an example, a spline interpolation for log-hazards is defined using \texttt{splineFun}. The transformation functions at the end of the table are meant to take one or more models as inputs and predict rates that are functions of the given models. Formally, Poisson regression models are models for rates rather than for hazards.

\section*{Appendix C: Smooth accelerated failure time models}

The smooth accelerated failure time models implemented by $\texttt{rstpm2::aft}$ and the Stata command \texttt{staft} have not been described elsewhere. Survival for these models at time $t$ with covariates $\bm{x}$ is defined by
\begin{align*}
S(t|\bm{x}) &= S_0\left(\int_0^t \exp(\eta(u,\bm{x})) \text{d}u\right)
\end{align*}
where $S_0(t) = \exp\left(-\exp\left(s(\log(t)) \right)\right)$ is baseline survival, $s(u)$ is a smooth function with support on the real line (e.g. natural splines), and $\exp(\eta(u,\bm{x}))$ is a time-varying acceleration factor. We can replace the integration for the acceleration factor by a cumulative function $\exp(\eta_1(\log(t),\bm{x}))$, such that
\begin{align*}
\int_0^t\exp(\eta(u,\bm{x}))\text{d}u &= \exp(\eta_1(\log(t),\bm{x})) \\
\implies \eta(t,\bm{x}) &= \eta_1(\log(t),\bm{x})+\log\left(\frac{\partial}{\partial t} \eta_1(\log(t),\bm{x})\right)
\end{align*}
such that we can calculate the time-varying acceleration factor by differentiation. The cumulative linear predictor $\eta_1(\log(t),\bm{x})$ can be defined using a smooth function of log time (e.g. using natural splines). We can calculate the hazards by
\begin{align*}
h(t|\bm{x}) &= \frac{\text{d}}{\text{d}t} \exp\left(s(\eta_1(\log(t),\bm{x}))\right) \\
&= \exp\left(s(\eta_1(\log(t),\bm{x}))\right) s'(\eta_1(\log(t),\bm{x})) \frac{\partial}{\partial t} \eta_1(\log(t),\bm{x})
\end{align*}
This model has been implemented in R and Stata for left truncated and right censored data.

% axiom: eta := operator 'eta; integrate(exp(eta(u)),u=0..t)

\end{document}