Revision

**fe07bfa906d7e155439160caee538a3449cd3877**authored by Dominique Makowski on**08 April 2019, 08:42:41 UTC**, committed by cran-robot on**08 April 2019, 08:42:41 UTC****0 parent**

indicesExistenceComparison.Rmd

```
---
title: "In-Depth 2: Comparison of Indices of Effect Existence"
output:
github_document:
toc: true
fig_width: 10.08
fig_height: 6
rmarkdown::html_vignette:
toc: true
fig_width: 10.08
fig_height: 6
tags: [r, bayesian, posterior, test]
vignette: >
\usepackage[utf8]{inputenc}
%\VignetteIndexEntry{In-Depth 2: Comparison of Indices of Effect Existence}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
bibliography: bibliography.bib
---
This vignette can be referred to by citing the package:
- Makowski, D. \& Lüdecke, D. (2019). *Understand and Describe Bayesian Models and Posterior Distributions using BayestestR*. Available from https://github.com/easystats/bayestestR. DOI: [10.5281/zenodo.2556486](https://zenodo.org/record/2556486).
# Indices of Effect *Existence* in the Bayesian Framework
There is now a general agreement that the Bayesian statistical framework is the right way to go for psychological science. Nevertheless, its flexible nature is its power and weakness, for there is no agreement about what indices should be computed or reported. Moreover, the lack of a consensual index of effect existence, such as the frequentist *p*-value, possibly contributes to the unnecessary murkiness that many non-familiar readers perceive in Bayesian statistics. Thus, this study describes and compares several indices of effect existence, provide intuitive visual representation of the "behaviour" of such indices in relationship with traditional metrics such as sample size and frequentist significance. The results contribute to develop the intuitive understanding of the values that researchers report and allow to draw recommendations for Bayesian statistics description, critical for the standardization of scientific reporting.
## Introduction
The Bayesian framework is quickly gaining popularity among psychologists and neuroscientists [@andrews2013prior]. Reasons to prefer this approach are reliability, better accuracy in noisy data, better estimation for small samples, less proneness to type I error, the possibility of introducing prior knowledge into the analysis and, critically, results intuitiveness and their straightforward interpretation [@dienes2018four;@etz2016bayesian;@kruschke2010believe;@kruschke2012time;@wagenmakers2018bayesian;@wagenmakers2016bayesian]. The frequentist approach has been associated with the focus on null hypothesis testing, and the misuse of *p*-values has been shown to critically contribute to the reproducibility crisis of psychological science [@chambers2014instead; @szucs2016empirical]. There is a general agreement that the generalization of the Bayesian approach is a way of overcoming this issue [@benjamin2018redefine; @etz2016bayesian; @maxwell2015psychology; @wagenmakers2017need].
While the probabilistic reasoning promoted by the Bayesian framework is pervading most of data science aspects, it is already well established for statistical modelling. This facet, on which psychology massively rely, could roughly be grouped into two soft-edged categories; predictive and structural modelling. Although a statistical model can (often) serve both purposes, predictive modelling is devoted to build and find the best model that accurately predicts a given outcome. It is centred around the concepts such as fitting metrics, predictive accuracy and model comparison. At the extrema of this dimension lie machine and deep learning models, used for their strong predictive power, often at the expense of Human readability [these models has been often refer to as "black-boxes", emphasising the difficulty to appraise their internal functioning; @burrell2016machine; @castelvecchi2016can; @snoek2012practical]. On the other side, psychologists are often using more simple models (for instance related to the general linear framework) to explore their data. Within this framework, the goal switches from building the best model to understanding the parameters inside the model. In reality, the methodological pipeline often starts with predictive modelling involving model comparison ("what is the best model of the world (*i.e.*, the observed variable)") and then seemingly transit to structural modelling: "given this model of the world, how the effects (*i.e.*, model parameters) are influencing the outcome". For this last part, they often rely on an index of effect "existence".
Indeed, while one of the strengths of the Bayesian framework is its probabilistic parameter estimation, allowing to quantify the inherent uncertainty associated with each estimation, psychologists are also interested in parameter *existence*: A decision criterion that allows them to conclude if an effect is "different from 0" (statistically corresponding to either "not negligible" or "not of the opposite direction"). In other words, to know, before taking interest in the importance, relevance or strength of the effect, whether it is related to the outcome in a given direction. This need has led to the wide adoption of the frequentist *p*-value, used as an index of effect existence, and its acceptance was accompanied with the creation of arbitrary clusters for its classification (.05, .01 and .001). Unfortunately, these heuristics have severely rigidify, becoming a goal and threshold to reach rather than a tool for understanding the data [@cohen2016earth; @kirk1996practical].
Thus, the ability of the Bayesian framework to answer psychological questions without the need of such null-hypothesis testing indices is often promoted as the promise of a "a new world without *p*-value" as opposed to the old and flawed frequentist one. Nonetheless, it seems that "effect existence" indices and criteria are useful for Humans to gain an intuitive understanding of the interactions and structure of their data. It is thus unsurprising that the development of Bayesian user-friendly implementations was accompanied with the promotion of the Bayes Factor (BF), an index reflecting the predictive performance of a model against another [*e.g.*, the null vs. the alternative hypothesis; @jeffreys1998theory; @ly2016harold]. It provides many advantages over the *p* value, having a straightforward interpretation ("the data were 3 times (*BF* = 3) more likely to occur under the alternative than the null hypothesis") and allowing to make statements about the alternative, rather than just the null hypothesis [@dienes2014using; @jarosz2014odds]. Moreover, recent mathematical developments allow its computation for complex models [@gronau2017bayesian; @gronau2017simple]. Although the BF lives up to the expectations of a solid, valid, intuitive and better index compared to the p value, its use for model selection is still a matter of debate [@piironen2017comparison]. Indeed, as the predictions used for its computation are generated from the prior distributions on the model parameters, it is highly dependent on priors specification [@etz2018bayesian; @kruschke2018bayesian]. Importantly for the aim of this paper, its use for estimating effect existence of parameters *within* a larger model remains limited, its computation being technically difficult and its interpretation not as straightforward as for "simple" tests such as t-tests or correlations.
Nevertheless, reflecting the need for such information, researchers have developed other indices based on characteristics of the posterior distribution, which represents the probability distribution of different parameter values given the observed data. This uncertainty can be summarized, for example, by presenting point-estimates of centrality (mean, median, ...) and of dispersion (standard deviation, median absolute deviance, ...), often accompanied with a percentage (90\% or 95\%) of the Highest Density Interval (HDI; referred to as the *Credible Interval* - CI). Although the Bayesian framework gives the possibility of computing many effect existence indices, no consensus has yet emerged on the ones to use, as no comparison has ever been done. This might be a rebuttal for scientists interested in adopting the Bayesian framework. Moreover, this grey area can increase the difficulty of readers or reviewers unfamiliar with the Bayesian framework to follow the assumptions and conclusions, which could in turn generate unnecessary doubt upon the entire study. While we think that such indices and their interpretation guidelines (in the form of rules of thumb) are useful in practice, we also strongly believe that such indices should be accompanied with the knowledge of the "behaviour" in relationship with sample size and effect size. This knowledge is important for people to implicitly and intuitively appraise the meaning and implication of the mathematical values they report. This could, in turn, prevent the crystallization of the possible heuristics and categories derived from such indices.
Thus, based on the simulation of multiple linear regressions (one of the most widely used models), the present work aims at comparing several indices of effect existence solely derived from the posterior distribution, provide visual representations of the "behaviour" of such indices in relationship with sample size, noise, priors and also the frequentist *p*-value (an index which, beyond its many flaws, is well known and could be used as a reference for Bayesian neophytes), and draw recommendations for Bayesian statistics reporting.
## Methods
### Procedure
We simulated 36000 Bayesian and frequentist linear regression models with systematic variation of "true" effect (1 or 0), sample size (20, 40, 60), levels of noise (0.1, 2.5, 5), and priors (correct - same as true effect or incorrect).
```{r message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(tidyr)
df <- read.csv("https://raw.github.com/easystats/circus/master/data/bayes_indices.csv")
```
For the sake of time and computational space, we downloaded the data from github. However, you can find the code to generate it [here](https://easystats.github.io/circus/articles/bayesian_indices.html).
### Indices
For all of the simulated models, we computed the following indices:
- ***p*-value**: Based on the frequentist regression, this index represents the probability for a given statistical model that, when the null hypothesis is true, the statistical summary (such as the sample mean difference between two compared groups) would be greater than or equal to the actual observed results [@wasserstein2016asa].
- ***p*-direction**: the "Probability of Direction" corresponds to the probability (expressed in percentage) that the effect is stricly positive or negative (consistently with the median's sign).
- ***p*-MAP**: the MAP-based *p*-value is related to the odds that a parameter has against the null hypothesis [@mills2014bayesian; @mills2017objective].
- ***p*-ROPE**: the ROPE-based *p*-value that represents the maximum percentage of HDI that does not contain (positive values) or is entirely contained (negative values) in the negligible values space defined by the ROPE.
- **ROPE (90\%)**: this index refers to the percentage of the 90\% HDI that lies within the Region of Practical Equivalence (ROPE) defined as ranging from -0.1 to 0.1.
- **ROPE (full)**: this index refers to the percentage of the whole posterior distribution that lies within the Region of Practical Equivalence (ROPE) defined as ranging from -0.1 to 0.1.
## Results
### Effect Detection
#### Sensitivity to Noise
```{r, message=FALSE, warning=FALSE}
df %>%
select(noise, effect, sample_size, p_frequentist, p_direction, p_map, p_rope, rope, rope_full) %>%
gather(index, value, -noise, -sample_size, -effect) %>%
mutate(noise = as.factor(noise),
effect = as.factor(effect),
sample_size = as.factor(sample_size),
index = factor(index, levels=c("p_frequentist", "p_direction", "p_map", "p_rope", "rope", "rope_full"))) %>%
ggplot(aes(x = noise, y=value, color = effect, fill=index)) +
geom_jitter(shape=16, alpha=0.02) +
geom_boxplot(outlier.shape = NA) +
facet_wrap(~index, scales = "free") +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_manual(values = c(`0` = "#f44336", `1` = "#8BC34A", name="Effect")) +
scale_fill_manual(values = c("p_frequentist"="#607D8B", "p_map" = "#E91E63", "p_direction" = "#2196F3",
"rope" = "#FF9800", "rope_full" = "#FF5722", "p_rope"="#FFC107"), guide=FALSE) +
ylab("Index Value\n") +
xlab("\nNoise")
```
#### Sensitivity to Sample Size
```{r, message=FALSE, warning=FALSE}
df %>%
select(noise, effect, sample_size, p_frequentist, p_direction, p_map, p_rope, rope, rope_full) %>%
gather(index, value, -noise, -sample_size, -effect) %>%
mutate(sample_size = as.factor(sample_size),
effect = as.factor(effect),
index = factor(index, levels=c("p_frequentist", "p_direction", "p_map", "p_rope", "rope", "rope_full"))) %>%
ggplot(aes(x = sample_size, y=value, color = effect, fill=index)) +
geom_jitter(shape=16, alpha=0.02) +
geom_boxplot(outlier.shape = NA) +
facet_wrap(~index, scales = "free") +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_manual(values = c(`0` = "#f44336", `1` = "#8BC34A", name="Effect")) +
scale_fill_manual(values = c("p_frequentist"="#607D8B", "p_map" = "#E91E63", "p_direction" = "#2196F3",
"rope" = "#FF9800", "rope_full" = "#FF5722", "p_rope"="#FFC107"), guide=FALSE) +
ylab("Index Value\n") +
xlab("\nSample Size")
```
#### Sensitivity to Priors
```{r, message=FALSE, warning=FALSE}
df %>%
select(noise, effect, sample_size, p_frequentist, p_direction, p_map, p_rope, rope, rope_full, prior_correct) %>%
gather(index, value, -noise, -sample_size, -effect, -prior_correct) %>%
mutate(sample_size = as.factor(sample_size),
effect = as.factor(effect),
prior_correct = as.factor(prior_correct),
index = factor(index, levels=c("p_frequentist", "p_direction", "p_map", "p_rope", "rope", "rope_full"))) %>%
ggplot(aes(x = effect, y=value, color = prior_correct, fill=index)) +
geom_jitter(shape=16, alpha=0.02) +
geom_boxplot(outlier.shape = NA) +
facet_wrap(~index, scales = "free") +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_manual(values = c(`0` = "#f44336", `1` = "#8BC34A", name="Correct Prior")) +
scale_fill_manual(values = c("p_frequentist"="#607D8B", "p_map" = "#E91E63", "p_direction" = "#2196F3",
"rope" = "#FF9800", "rope_full" = "#FF5722", "p_rope"="#FFC107"), guide=FALSE) +
ylab("Index Value\n") +
xlab("\nEffect")
```
#### Combined sensitivity
```{r, message=FALSE, warning=FALSE}
df %>%
mutate(p_frequentist = scale(p_frequentist),
p_direction = scale(p_direction),
p_map = scale(p_map),
p_rope = scale(p_rope),
rope = scale(rope),
rope_full = scale(rope_full)) %>%
select(noise, effect, sample_size, p_frequentist, p_direction, p_map, p_rope, rope, rope_full, prior_correct) %>%
gather(index, value, -noise, -sample_size, -effect, -prior_correct) %>%
mutate(sample_size = scale(sample_size),
noise=scale(noise),
effect = as.factor(effect),
prior_correct = as.factor(prior_correct),
index = factor(index, levels=c("p_frequentist", "p_direction", "p_map", "p_rope", "rope", "rope_full"))) %>%
glm(effect ~ index/value * noise * sample_size * prior_correct, data=., family="binomial") %>%
broom::tidy() %>%
select(term, estimate, p=p.value) %>%
filter(stringr::str_detect(term, 'value')) %>%
mutate(term = stringr::str_remove(term, ":value"),
term = stringr::str_remove(term, "index"),
p = ifelse(p < .001, "< .001***", ifelse(p < .01, "< .01**", ifelse(p < .05, "< .05*", "> .05")))) %>%
knitr::kable(digits=2)
```
### Relationship with the frequentist *p* value
```{r, message=FALSE, warning=FALSE}
df %>%
select(noise, sample_size, p_frequentist, p_direction, p_map, p_rope, rope, rope_full, effect) %>%
gather(index, value, -noise, -p_frequentist, -sample_size, -effect) %>%
mutate(effect = as.factor(effect),
sample_size = as.factor(sample_size),
index = factor(index, levels=c("p_frequentist", "p_direction", "p_map", "p_rope", "rope", "rope_full"))) %>%
ggplot(aes(x = p_frequentist, y = value, color = effect, shape=sample_size)) +
geom_point(alpha=0.025) +
facet_wrap(~index, scales = "free") +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_manual(values = c(`0` = "#f44336", `1` = "#8BC34A"), name="Effect") +
guides(colour = guide_legend(override.aes = list(alpha = 1)),
shape = guide_legend(override.aes = list(alpha = 1), title="Sample Size"))
```
### Relationship with frequentist *p*-based arbitrary clusters
```{r, message=FALSE, warning=FALSE, fig.height=15, fig.width=10}
df$sig_1 <- factor(ifelse(df$p_frequentist >= .1, "n.s.", "-"), levels=c("n.s.", "-"))
df$sig_05 <- factor(ifelse(df$p_frequentist >= .05, "n.s.", "*"), levels=c("n.s.", "*"))
df$sig_01 <- factor(ifelse(df$p_frequentist >= .01, "n.s.", "**"), levels=c("n.s.", "**"))
df$sig_001 <- factor(ifelse(df$p_frequentist >= .001, "n.s.", "***"), levels=c("n.s.", "***"))
get_data <- function(predictor, outcome, lbound=0, ubound=0.3){
fit <- glm(paste(outcome, "~", predictor), data=df, family = "binomial")
data <- data.frame(x=1:100)
data[predictor] <- seq(lbound, ubound, length.out = 100)
data$index <- predictor
predict_fit <- predict(fit, newdata=data, type="response", se.fit = TRUE)
data[outcome] <- predict_fit$fit
data$CI_lower <- predict_fit$fit - (qnorm(0.99) * predict_fit$se.fit)
data$CI_upper <- predict_fit$fit + (qnorm(0.99) * predict_fit$se.fit)
data <- select_(data, "value"=predictor, outcome, "index", "CI_lower", "CI_upper")
return(data)
}
rbind(
rbind(
get_data(predictor="p_map", outcome="sig_001", lbound=0, ubound=0.02),
get_data(predictor="p_direction", outcome="sig_001", lbound=99.5, ubound=100),
get_data(predictor="rope", outcome="sig_001", lbound=0, ubound=0.5),
get_data(predictor="rope_full", outcome="sig_001", lbound=0, ubound=0.5),
get_data(predictor="p_rope", outcome="sig_001", lbound=95, ubound=100)
) %>%
rename("sig"=sig_001) %>%
mutate(threshold="p < .001"),
rbind(
get_data(predictor="p_map", outcome="sig_01", lbound=0, ubound=0.1),
get_data(predictor="p_direction", outcome="sig_01", lbound=98, ubound=100),
get_data(predictor="rope", outcome="sig_01", lbound=0, ubound=2),
get_data(predictor="rope_full", outcome="sig_01", lbound=0, ubound=2),
get_data(predictor="p_rope", outcome="sig_01", lbound=90, ubound=100)
) %>%
rename("sig"=sig_01) %>%
mutate(threshold="p < .01"),
rbind(
get_data(predictor="p_map", outcome="sig_05", lbound=0, ubound=0.3),
get_data(predictor="p_direction", outcome="sig_05", lbound=95, ubound=100),
get_data(predictor="rope", outcome="sig_05", lbound=0, ubound=10),
get_data(predictor="rope_full", outcome="sig_05", lbound=0, ubound=10),
get_data(predictor="p_rope", outcome="sig_05", lbound=70, ubound=100)
) %>%
rename("sig"=sig_05) %>%
mutate(threshold="p < .05"),
rbind(
get_data(predictor="p_map", outcome="sig_1", lbound=0, ubound=0.5),
get_data(predictor="p_direction", outcome="sig_1", lbound=90, ubound=100),
get_data(predictor="rope", outcome="sig_1", lbound=0, ubound=20),
get_data(predictor="rope_full", outcome="sig_1", lbound=0, ubound=20),
get_data(predictor="p_rope", outcome="sig_1", lbound=0, ubound=100)
) %>%
rename("sig"=sig_1) %>%
mutate(threshold="p < .1")
) %>%
mutate(index = as.factor(index)) %>%
ggplot(aes(x=value, y=sig)) +
geom_ribbon(aes(ymin=CI_lower, ymax=CI_upper), alpha=0.1) +
geom_line(aes(color=index), size=1) +
facet_wrap(~ index * threshold, scales = "free", nrow=5) +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_manual(values = c("p_frequentist"="#607D8B", "p_map" = "#E91E63", "p_direction" = "#2196F3",
"rope" = "#FF9800", "rope_full" = "#FF5722", "p_rope"="#FFC107"), guide=FALSE) +
ylab("Probability of being significant\n") +
xlab("\nIndex Value")
```
### Relationship with Equivalence test
```{r, message=FALSE, warning=FALSE, fig.height=15, fig.width=10}
df$equivalence <- factor(ifelse(df$rope == 0, "significant", "n.s."), levels=c("n.s.", "significant"))
rbind(
get_data(predictor="p_map", outcome="equivalence", lbound=0, ubound=0.4),
get_data(predictor="p_direction", outcome="equivalence", lbound=92.5, ubound=100),
get_data(predictor="rope_full", outcome="equivalence", lbound=0, ubound=7),
get_data(predictor="p_frequentist", outcome="equivalence", lbound=0, ubound=0.15)
) %>%
ggplot(aes(x=value, y=equivalence)) +
geom_ribbon(aes(ymin=CI_lower, ymax=CI_upper), alpha=0.1) +
geom_line(aes(color=index), size=1) +
facet_wrap(~ index, scales = "free", nrow=5) +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_manual(values = c("p_frequentist"="#607D8B", "p_map" = "#E91E63", "p_direction" = "#2196F3",
"rope" = "#FF9800", "rope_full" = "#FF5722", "p_rope"="#FFC107"), guide=FALSE) +
ylab("Probability of rejecting H0 with the equivalence test\n") +
xlab("\nIndex Value")
```
### Relationship between ROPE (full) and p (direction)
```{r, message=FALSE, warning=FALSE, fig.height=6, fig.width=10.08}
df %>%
mutate(effect = as.factor(effect)) %>%
ggplot(aes(x=p_direction, y=rope_full, color=effect)) +
geom_point(alpha=0.025) +
theme_classic() +
theme(strip.background = element_blank()) +
scale_color_manual(values = c(`0` = "#f44336", `1` = "#8BC34A"), name="Effect") +
ylab("ROPE (full)\n") +
xlab("\nIndex Value")
```
## Discussion
Based on the simulation of multiple linear regressions, the present work aimed at comparing several indices of effect existence solely derived from the posterior distribution, provide visual representations of the "behaviour" of such indices in relationship with sample size, noise, priors and the frequentist *p*-value.
While this comparison with a frequentist index may seem counterintuitive or wrong (as the Bayesian thinking is intrinsically different from the frequentist framework), we believe that this comparison is interesting for didactic reasons. The frequentist *p*-value “speaks” to many and can thus be seen as a reference and a way to facilitate the shift toward the Bayesian framework. This does not preclude, however, that a change in the general paradigm of effect existence seeking in necessary, and that Bayesian indices are fundamentally different from the frequentist *p*, rather than mere approximations or equivalents. Critically, we strongly agree on the distinction and possible dissociation between an effect’s existence and meaningfulness [@lakens2018equivalence]. Nevertheless, we believe that assessing whether an effect is "meaningful" is highly dependent on the literature, priors, novelty, context or field, and that it cannot be assessed based solely on a statistical index (even though some of the indices, such as the ROPE-related ones, attempt at bridging existence with meaningfulness). Thus, researchers should rely on statistics to assess effect existence (as well as size and direction estimation), and systematically, but contextually, discuss its meaning and importance within a larger perspective.
Conclusions can be found in the [guidelines section](https://easystats.github.io/bayestestR/articles/guidelines.html).
# References
```

Computing file changes ...