1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221 | ---
title: "cutpointr: Bootstrapping"
author: "Christian Thiele, Lorenz A. Kapsner"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Bootstrapping with cutpointr}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
markdown:
wrap: 80
---
```{r, include = FALSE}
knitr::opts_chunk$set(fig.width = 6, fig.height = 5, fig.align = "center")
options(rmarkdown.html_vignette.check_title = FALSE)
```
Bootstrapping is implemented in **cutpointr** with two goals:
1. Determine optimal cutpoints with bootstrapping (as an alternative to
determining them without bootstrapping)
2. Validate (any) cutpoint optimization with bootstrapping
This vignette will briefly go through some examples for both approaches.
# Determine optimal cutpoints
## Without bootstrapping: `maximize_metric`
As a first basic example, the cutpoint optimization will be demonstrated without
any bootstrapping by maximizing the Youden-Index. Using the method
`maximize_metric`, this is performed on the full data set:
```{r}
library(cutpointr)
data(suicide)
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_metric,
metric = youden,
pos_class = "yes",
direction = ">="
)
summary(opt_cut)
```
The fields in the resulting R object `opt_cut` are to be interpreted as follows:
- `$optimal_cutpoint`: The optimal cutpoint determined by
maximizing the Youden-Index on the full `suicide` dataset.
- `$sensitivity`: The sensitivity when applying the cutpoint to the full
dataset.
- `$specificity`: The specificity when applying the cutpoint to the full
dataset.
- `$youden`: The maximal Youden-Index (= sensitivity + specificity - 1),
determined by the optimization.
## Bootstrap cutpoints: `maximize_boot_metric`
The determination of the optimal cutpoint can also be performed using
bootstrapping. Therefore, the methods
`maximize_boot_metric`/`minimize_boot_metric` need to be chosen. These functions
provide further arguments that can be used to configure the bootstrapping. These
arguments can be viewed with `help("maximize_boot_metric", "cutpointr")`. The
most important arguments are:
- `boot_cut`: The number of bootstrapping repetitions.
- `boot_stratify`: If the bootstrap samples are drawn in both classes
separately before combining them, keep the number of positives/negatives
constant in every sample.
- `summary_func`: The summary function to aggregate the optimal cutpoints from
the bootstrapping to arrive at one final optimal cutpoint.
The cutpoint is optimized in n=`boot_cut` bootstrap samples by maximizing/
minimizing the respective metric (e.g., the Youden-index in this example) in each
of these bootstrap samples. Finally, the summary function is applied to
aggregate the optimal cutpoints from the n=`boot_cut` bootstrap samples into one
final 'optimal' cutpoint.
```{r}
set.seed(123)
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_boot_metric,
boot_cut = 200,
summary_func = mean,
metric = youden,
pos_class = "yes",
direction = ">="
)
summary(opt_cut)
```
The fields in the resulting R object `opt_cut` are to be interpreted as follows:
- `$optimal_cutpoint`: The optimal cutpoint, which is the aggregated value (as
defined with `summary_func`) over all n=`boot_cut` bootstrap samples. Please
note that no uncertainty measure (standard deviation, 95%-CI, etc.) is
available here (a bootstrap distribution of these cutpoints can be generated
using outer bootstrapping with `boot_runs > 0` and `maximize_metric`,
as explained below).
- `$sensitivity`: The sensitivity when applying the optimal cutpoint to the
full dataset.
- `$specificity`: The specificity when applying the optimal cutpoint to the
full dataset.
- `$youden`: The Youden-Index when applying the optimal cutpoint to the full
dataset.
# Validate cutpoint optimization with bootstrapping
Any chosen methods to find the optimal cutpoints can be subsequently validated
with bootstrapping. This can easily be activated by setting the
argument `boot_runs` \> 0. Please be aware that the first steps to calculate the
optimal cutpoints with the specified method (as described above) will be
performed in the very same manner as above, resulting in the same outputs as
above (depending on the seed when bootstrapping cutpoints).
However, the method to calculate the optimal cutpoints will then additionally be
performed on n=`boot_runs` bootstrap samples. For each of these bootstrap
samples, several metrics and performance measures are available from the
resulting `$boot` object, both for the *in-bag* (suffix: '\_b') and the
*out-of-bag* (suffix: '\_oob') bootstrap samples. Please note that the optimal
cutpoint is determined on the in-bag samples only and then just applied to the
out-of-bag samples for validation purposes, so its value is available only once
in the `$boot` object without a suffix.
## `maximize_metric`
```{r}
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_metric,
metric = youden,
pos_class = "yes",
direction = ">=",
boot_runs = 100
)
```
The interpretation of fields in the resulting R object `opt_cut` is the same as
above. The results from the bootstrapping are available from `$boot`.
```{r}
summary(opt_cut)
opt_cut$boot[[1]] |>
head()
```
## `maximize_boot_metric`
When bootstrapping cutpoints and also using the validation with bootstrapping,
the optimal cutpoint will again first be determined as above in n=`boot_cut`
bootstrap samples by maximizing/ minimizing the respective metric in each of
these bootstrap samples and then by applying the summary function to aggregate
the optimal cutpoints from the n=`boot_cut` bootstrap samples into one final
'optimal' cutpoint. Hence, using the same seeds here results in the same outputs
as above, where no outer bootstrapping is applied.
In the validation routine, the chosen cutpoint optimization is then repeated in
each of the n=`boot_runs` (outer) bootstrap samples: the optimal cutpoint is
determined in each bootstrap sample by optimizing the `metric` on n=`boot_cut`
(inner) bootstrap samples and applying the `summary_func` to aggregate them into
one value.
Since the (inner) bootstrapping of optimal cutpoints is performed in each of the
(outer) validation bootstrap samples, this can be computational very expensive
and take some time to finish. Therefore, parallelization is implemented in
`cutpointr` by just setting its argument `allowParallel = TRUE` and initializing
a parallel environment.
```{r}
library(doParallel)
library(doRNG)
cl <- makeCluster(2) # 2 cores
registerDoParallel(cl)
registerDoRNG(12)
set.seed(123)
opt_cut <- cutpointr(
data = suicide,
x = dsi,
class = suicide,
method = maximize_boot_metric,
boot_cut = 200,
summary_func = mean,
metric = youden,
pos_class = "yes",
direction = ">=",
boot_runs = 100,
allowParallel = TRUE
)
stopCluster(cl)
```
Again, the interpretation of fields in the resulting R object `opt_cut` is the
same as above. The results from the bootstrapping are available from `$boot`.
```{r}
summary(opt_cut)
opt_cut$boot[[1]] |>
head()
```
Some visualizations of the bootstrapping results are available with the `plot`
function:
```{r}
plot(opt_cut)
```
The two plots in the lower half can be generated separately with
`plot_cut_boot(opt_cut)` and `plot_metric_boot(opt_cut)`.
|