https://github.com/jona-sassenhagen/statfail
Raw File
Tip revision: 8f6cb47adf51a24b3b762c110421c3c5efb80603 authored by Phillip Alday on 22 October 2019, 01:37:52 UTC
Merge pull request #2 from jona-sassenhagen/dependabot/pip/nltk-3.4.5
Tip revision: 8f6cb47
snl2016.md
---
title: Confound and control in language experiments
author:
    - Phillip M. Alday
    - Jona Sassenhagen
date: April 2016
---
Experimental research on the neurobiology of language depends on stimulus or participant selection where not all characteristics can be fully controlled, even when attempting strict matching.
In linguistic stimuli, factors such as word frequency are often correlated with primary factors of interest, such as animacy, or other co-confounds such as word length.
In clinical studies, factors such as intelligence or socioeconomic status are often correlated with pathology.
Inferential statistics are often used to demonstrate that these confounding covariates do not differ systemantically between groups.
For example, if word length is not significantly different for two conditions, they are considered as matched for word length.
Such a test has high error rates and is conceptually misguided, both statistically and experimentally, yet is commonly used (67% of papers in a survey of recent Brain and Language issues).

Statistically, it reflects a common misunderstanding of statistical tests: interpreting significance not to refer to inference about a particular population parameter, but about 1. the sample in question, 2. the practical relevance of a sample difference (so that a nonsignificant test is taken to indicate evidence for the absence of relevant differences).

The correct way to control for this type of confound is to match descriptive statistics (measures of central location such as mean and median, as well as variance) and then include a 'nuisance' covariate in an appropriate statistical model, such as a mixed-effects model.
The impact of the confound can be examined both by its associated model parameters and by comparison between models with and without the nuisance covariate.

Experimentally, this form of narrow control fails to capture the reality of complex interactions and thus can not adequately discern between different, potentially interacting effects.
For example, marked constructions are generally less frequent than unmarked ones, leading to the claim that observed differences in processing are frequency effects.
In the traditional method of inferential-test based control, there is still some (hopefully, minimal) variation between experimental conditions that is not modelled.
By modelling this frequency effect directly (e.g. with the help of mixed-effects models),
we can  examine the piecewise contributions from the experimental manipulation and the 'confound'.
The (perhaps looser) control based on descriptive statistics is still nonetheless necessary here to minimise collinearity. 
An interation between an experimental manipulation and frequency thus provides evidence that the manipulation is not merely an artefact of frequency.

Beyond these conceptual issues, we provide simulation results that demonstrate the high failure rate of inferential-test based confound control, both failing to detect spurious results arising from confounds and falsely rejecting real effects.

Inferential testing is thus inappropriate both pragmatically and philosophically, yet widely applied. 
Incorporating potential confounds into our statistical models provides an easy alternative, especially in light of the recent trend away from ANOVA and towards mixed-effects and structural-equation models.
Finally, the modelling of confounds allows to better examine the complex interactions present in real language use and to better distinguish between competing confounded explanations.
back to top