https://github.com/cran/survival
Raw File
Tip revision: 38a8ccc0fb014c2b2bfbb2b5153b419e615b01ba authored by Terry M Therneau on 16 April 2016, 19:26:46 UTC
version 2.39-2
Tip revision: 38a8ccc
splines.Rnw
\documentclass{article}
\usepackage{amsmath}
\usepackage{Sweave}
\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}

\SweaveOpts{keep.source=TRUE, fig=FALSE}
%\VignetteIndexEntry{Splines, plots, and interactions}
% Ross Ihaka suggestions
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}

\SweaveOpts{prefix.string=splines,width=6,height=4}
\setkeys{Gin}{width=\textwidth}
\newcommand{\code}[1]{\texttt{#1}}

<<echo=FALSE>>=
options(continue="  ", width=60)
options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1))))
pdf.options(pointsize=8) #text in graph about the same as regular text
options(contrasts=c("contr.treatment", "contr.poly")) #reset default
@ 

\title{Spline terms in a Cox model}
\author{Terry Therneau}

\begin{document}
\maketitle
This is a pair of topics that comes up just often enough in my work 
that I end up re-discovering how to do it correctly about once a year.
A note showing how may be useful to others, it is certainly a useful
reference for me.

\section{Plotting smooth terms}
Here is a simple example using the MGUS data.
I prefer a simpler color palette than the default found in termplot.
<<mplot, fig=TRUE>>=
require(survival)
mfit <- coxph(Surv(futime, death) ~ sex + pspline(age), data=mgus)
termplot(mfit, term=2, se=TRUE, col.term=1, col.se=1)
@ 

Two questions of the plot are ``how was it centered'' and whether we
can easily plot it on the hazard as opposed to the log hazard scale.
The solution to both is to use the plot=FALSE option of termplot,
which returns the data points that would be plotted back to the user.

<<mplot2>>=
ptemp <- termplot(mfit, se=TRUE, plot=FALSE)
attributes(ptemp)
ptemp$age[1:4,]
@ 

The termplot function depends on a call to predict with type='terms',
which returns a centered set of predictions.
Like a simple linear model fit, the intercept is a separate term, which is
found in the ``constant'' attribute above,
and each column of the result is centered at zero.  
Since any given $x$ value may appear multiple times in the data and thus in
the result of predict,
and the termplot function removes duplicates, the plot may not
be exactly centered at zero however.

Now suppose we want to redraw this on log scale with age 50 as the reference,
i.e., the  risk is 1 for a 50 year old.  
Since the Cox model is a relative hazards model we can choose whatever center
we like.
(If there were no one of exactly age 50 in the data set the first line 
below would need to do an interpolation, e.g. by using the approx function.)

<<mplot3, fig=TRUE>>=
ageterm <- ptemp$age  # this will be a data frame
center <- with(ageterm, y[x==50])
ytemp <- ageterm$y + outer(ageterm$se, c(0, -1.96, 1.96), '*')
matplot(ageterm$x, exp(ytemp - center), log='y',
        type='l', lty=c(1,2,2), col=1, 
        xlab="Age at diagnosis", ylab="Relative death rate")
@ 

Voila! We now have a plot that is more interpretable.  
The approach is appropriate for any term, not just psplines.
The above plot uses log scale for the y axis which is appropriate
for the question of whether a non-linear age effect was even necessary
for this data set (it is not),
one could remove the log argument to emphasize the Gomperzian effect of
age on mortality.

\section{Splines in an interaction}
As an example we will use the effect of age on survival in the 
\texttt{flchain} data set, a population based sample of subjects from
Olmsted County, Minnesota.
If we look at a simple model using age and sex we see that both 
are very significant.
<<fit1, fig=TRUE>>=
options(show.signif.stars=FALSE) # display intelligence
fit1 <- coxph(Surv(futime, death) ~ sex + pspline(age, 3), data=flchain)
fit1
termplot(fit1, term=2, se=TRUE, col.term=1, col.se=1,
         ylab="log hazard")
@ 

We used a \code{pspline} term rather than \code{ns}, say, 
because the printout for a pspline nicely
segregates the linear and non-linear age effects.
The non-linearity is not very large, as compared to the linear portion,
but still may be important.

We would like to go forward and fit separate age curves for the males and
the females, since the above fit makes an unwarranted 
assumption that the male/female
ratio of death rates will be the same at all ages. 
The primary problem is that a formula of 
\texttt{sex * pspline(age)} does not work;
the coxph routine is not clever enough to do the right thing automatically.
(Perhaps some future version will be sufficiently intelligent, but don't hold
your breath).
If we were using regression splines instead, e.g. \texttt{ns(age, df=4)},
a simple call coxph routine using the interaction term would succeed,
but then termplot would fall short.  The solution below suffices for all
cases.

First, we need to create our own dummy variables to handle the interaction.

<<fit2>>=
agem <- with(flchain, ifelse(sex=="M", age, 60))
agef <- with(flchain, ifelse(sex=="F", age, 60))
fit2 <- coxph(Surv(futime, death) ~ sex + pspline(agef, df=3)
              + pspline(agem, df=3), data=flchain)
anova(fit2, fit1)
@ 
 
The gain in this particular problem is not great, but we will forge ahead.
You might well ask why we used 60 as a dummy value of \texttt{agem} for
the females instead of 0?
There is nothing special about the choice, and any value within
the range of ages would do as well, though I try to pick one where the
standard errors of the curves are not outrageous.  
If a value of 0 is used it forces the pspline function to create a basis
set that includes all the empty space between 0 and 50, and do predictions
at 0; these last can become numerically unstable leading to errors or
incorrect values.

The Cox model deals with relative hazards, when doing a plot we will usually
want to specify who our reference is.  
By default the termplot function uses an average reference, that is, any
plot will be centered to have an average log hazard of 0.
In this case, we decided to use 65 year old females as our reference, with
all of the hazards relative to them. 

<<plot2, fig=TRUE>>=
# predictions
pterm <- termplot(fit2, term=2:3, se=TRUE, plot=FALSE)
# reference
refdata <- data.frame(sex=c('F', 'M'), agef=c(65, 60), agem=c(60,65))
pred.ref <- predict(fit2, newdata=refdata, type="lp")
# females
tempf <- pterm$agef$y + outer(pterm$agef$se, c(0, -1.96, 1.96))
frow <- which(pterm$agef$x == 65)
tempf <- tempf  - tempf[frow,1]  # shift curves
# males
tempm <- pterm$agem$y + outer(pterm$agem$se, c(0, -1.96, 1.96))
mrow  <- which(pterm$agem$x == 65)
tempm <- tempm + diff(pred.ref) - tempm[mrow,1]
# plot
matplot(pterm$agef$x, exp(tempf), log='y', col=1, 
        lty=c(1,2,2), type='l', lwd=c(2,1,1),
        xlab="Age", ylab="Relative risk of death")
matlines(pterm$agem$x, exp(tempm), log='y', 
         col=2, lwd=c(2,1,1), lty=c(1,2,2))
legend(80, 1, c("Female", "Male"), lty=1, lwd=2, col=1:2, bty='n')
@ 

\begin{enumerate}
  \item The termplot routine is used to get the data points for the plot,
    without executing a plot, by use of the \texttt{plot=FALSE} argument.
    The result is a list with one element per term; each element of the list
    contains x, y, and se components.
  \item We had decided to center the female curve at age 65, risk =1.  The
    relative offset for the male curve can be derived directly from
    \texttt{fit2} by adding up the right coefficients, and I used to do it that
    way but would get it wrong one time out of two.  
    So instead use the \texttt{predict} routine to get predicted log hazards
    for males and females at a particular age.  This tells me how far apart
    the curves should be at that point.  We force the females to go through
    0, which is exp(0) =1 on the hazard scale.
  \item Get the predicted curve and confidence bands for the females
    as a matrix \texttt{tempf}, and then shift them by subtracting the value
    for a 65 year old female.  Do the same for males, plus adding in
    the curve separation at age 65 from \texttt{pred.ref}.
  \item The male and female portions don't have quite the same set of age
    values, there are no 95 year old males in the data set for example,
    so the plot needs to be done in two steps.
\end{enumerate}

The final curves for males and female are not quite parallel.
One thing the plot does not display is that the spacing between the
male and female points also has a standard error.
This moves the entire bundle of three red curves up and down.  It is not
clear how best to add this information into the plot.
For questions of parallelism and shape, as here, it seemed best to
ignore it, which is what the termplot function also does.
If someone were reading individual male/female differences off the plot
a different choice would be appropriate.
\end{document}

back to top