Content - 30ae484e7bf494e90ac6758c6fc410d894a0bf7c - 363d594/vignettes/categorizing_fits.Rmd

visit type:
Tip revision: 08bf9c45d2b32554168fe21e673015122e3c37e9 authored by Claus Wilke on 08 May 2021, 00:57:47 UTC
fix badge
Tip revision: 08bf9c4
categorizing_fits.Rmd
---
title: "Identifying the best-fitting model category"
author: "Umut Caglar, Claus O. Wilke"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Identifying the best-fitting model category}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
  
```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
```

In high-throughput scenarios, we often don't know a priori whether a given dataset represents a sigmoidal curve, a double-sigmoidal curve, or neither. In these cases, we need to fit multiple different models to the data and then determine which model fits the best. This process is normally done automatically by the function `sicegar::fitAndCategorize()`. In this vignette, we describe how this process works and how it can be done manually.

```{r install packages, echo=FALSE, warning=FALSE, results='hide', message=FALSE}

###*****************************
# INITIAL COMMANDS TO RESET THE SYSTEM
rm(list = ls())
if (is.integer(dev.list())){dev.off()}
cat("\014")
seedNo=14159
set.seed(seedNo)
###*****************************

###*****************************
require(sicegar)
require(dplyr)
require(cowplot)
###*****************************
```

We will demonstrate this process for an artificial dataset representing a double-sigmoidal function. We first generate the data:

```{r generate data for double - sigmoidal}
time <- seq(3, 24, 0.5)
noise_parameter <- 0.2
intensity_noise <- runif(n = length(time), min = 0, max = 1) * noise_parameter
intensity <- doublesigmoidalFitFormula(time,
                                       finalAsymptoteIntensityRatio = .3,
                                       maximum = 4,
                                       slope1Param = 1,
                                       midPoint1Param = 7,
                                       slope2Param = 1,
                                       midPointDistanceParam = 8)
intensity <- intensity + intensity_noise
dataInput <- data.frame(time, intensity)
```

We need to fit the sigmoidal and double-sigmoidal models to this dataset, using `multipleFitFunction()`. This requires first normalizing the data:
```{r}
normalizedInput <- normalizeData(dataInput = dataInput, 
                                                dataInputName = "doubleSigmoidalSample")

# Fit sigmoidal model
sigmoidalModel <- multipleFitFunction(dataInput = normalizedInput,
                                          model = "sigmoidal",
                                          n_runs_min = 20,
                                          n_runs_max = 500,
                                          showDetails = FALSE)

# Fit double-sigmoidal model
doubleSigmoidalModel <- multipleFitFunction(dataInput = normalizedInput,
                                                model = "doublesigmoidal",
                                                n_runs_min = 20,
                                                n_runs_max = 500,
                                                showDetails = FALSE)
```

We also need to perform the additional parameter calculations, as these are required by the `categorize()` function we use below.

```{r linear sigmoidal amd double-sigmoidal fits to double-sigmoidal data}
# Calculate additional parameters
sigmoidalModel <- parameterCalculation(sigmoidalModel)

# Calculate additional parameters
doubleSigmoidalModel <- parameterCalculation(doubleSigmoidalModel)
```

This is what the two fits look like:
```{r echo=FALSE, warning=FALSE, message=FALSE, fig.width=7}
f1 <- figureModelCurves(dataInput = normalizedInput,
                        sigmoidalFitVector = sigmoidalModel,
                        showParameterRelatedLines = TRUE)

f2 <- figureModelCurves(dataInput = normalizedInput,
                        doubleSigmoidalFitVector = doubleSigmoidalModel,
                        showParameterRelatedLines = TRUE)

plot_grid(f1, f2)
```

Clearly the sigmoidal fit is not appropriate but the double-sigmoidal one is. Next we demonstrate how to arrive at this conclusion computationally, using the function `categorize()`. It takes as input the two fitted models as well as a number of parameters that are used in the decision process (explained below under "The decision process").


```{r}
# now we can categorize the fits
decisionProcess <- categorize(threshold_minimum_for_intensity_maximum = 0.3,
                                      threshold_intensity_range = 0.1,
                                      threshold_t0_max_int = 0.05,
                                      parameterVectorSigmoidal = sigmoidalModel,
                                      parameterVectorDoubleSigmoidal = doubleSigmoidalModel)
```

The object returned by `categorize()` contains extensive information about the decision process, but the key component is the `decision` variable. Here, it states that the data fits the double-sigmoidal model:
```{r}
print(decisionProcess$decision)
```
(The possible values here are "no_signal", "sigmoidal", "double_sigmoidal", and "ambiguous".)

## The decision process

The decision process consists of two parts. First, the `categorize()` function checks whether all provided input data are valid. The steps of this verification are as follows:

* Pre-test 0: Was the `categorize()` function provided with sigmoidal and double_sigmoidal models as input?
* Pre-test 1: Do the provided sigmoidal and double sigmoidal models come from the same source and have the same data name?
* Pre-test 2a: Was the provided sigmoidal model generated by `sicegar::sigmoidalFitFunctions`?
* Pre-test 2b: Was the provided double_sigmoidal model generated by `sicegar::doublesigmoidalFitFunctions`?
* Pre-test 3: Do both models have same scaling parameters obtained from the data normalization process?
* Pre-test 4a: Have the additional parameters for sigmoidal fit been calculated by `sicegar::parameterCalculation()`?
* Pre-test 4b: Have the additional parameters for double-sigmoidal fit been calculated by `sicegar::parameterCalculation()`?

After these steps, the primary decision process begins. It takes a list of four possible outcomes ("no_signal", "sigmoidal", "double_sigmoidal", "ambiguous") and systematically removes options until only one remains.

First, the algorithm checks if the provided data includes a signal or not.

* Test 1a: The observed intensity maximum must be bigger than `threshold_minimum_for_intensity_maximum`; otherwise, the data is labeled with `"no_signal"`.
* Test 1b: The intensity range, i.e., the absolute difference between the biggest and smallest observed intensity, must be greater than `threshold_intensity_range`; otherwise, the data is labeled with `"no_signal"`.
* Test 1c: If at this point the data is not labeled with `"no_signal"`, then the data can not be labeled with `"no signal"` anymore.

Next the algorithm checks if the sigmoidal and double sigmoidal models make sense.

* Test 2a: The provided sigmoidal fit must be a successful fit; otherwise, the data cannot be labeled with `"sigmoidal"`.
* Test 2b: The provided double-sigmoidal fit must be a successful fit; otherwise, the data cannot labelled with `"double_sigmoidal"`.
* Test 3a: The sigmoidal fit must have an AIC score smaller than `threshold_AIC`; otherwise, the data can not be labeled with `"sigmoidal"`.
* Test 3b: The double-sigmoidal fit must have an AIC score smaller than `threshold_AIC`; otherwise, the data cannot be labeled with `"double_sigmoidal"`.
* Test 4a: The value `startPoint_x` for the sigmoidal model must be a positive number; otherwise, the data cannot be labeled with `"sigmoidal"`.
* Test 4b: The value `startPoint_x` for the double-sigmoidal model must be a positive number; otherwise, the data cannot be labeled with `"double_sigmoidal"`.
* Test 5a: The value `start_intensity` for the sigmoidal model must be smaller than `threshold_t0_max_int`; otherwise, the data cannot be labeled with `"sigmoidal"`.
* Test 5b: The value `start_intensity` for the double-sigmoidal model must be smaller than `threshold_t0_max_int`; otherwise, the data cannot be labeled with `"double_sigmoidal"`.
* Test 6: For the double-sigmoidal model, the ratio of _/the model's intensity prediction at the last observation time/_ to _/the model's maximum intensity prediction/_ must be smaller than `threshold_dsm_tmax_IntensityRatio`; otherwise, the data cannot be labeled with `"double_sigmoidal"`.
* Test 7: For the sigmoidal model; the ratio of _/the model's intensity prediction at the last observation time/_ to _/the model's maximum intensity prediction/_ must be larger than `threshold_sm_tmax_IntensityRatio`; otherwise, the data cannot be labeled with `"sigmoidal"`.

In step eight, the algorithm checks whether the data should be labelled as `"ambiguous"` or not.

* Test 8: If at this point we still have at least one of the two options `"sigmoidal"` or `"double_sigmoidal"`, then the data cannot be labeled with `"ambiguous"`.

In the last step; the algorithm checks whether the data should be labeled as `"sigmoidal"` or `"double_sigmoidal"`.

* Test 9: If at this point we still have both the `"sigmoidal"` and `"double_sigmoidal"` options, then the choice will be made based on the AIC scores of those models and value of `threshold_bonus_sigmoidal_AIC`. If `sigmoidalAIC + threshold_bonus_sigmoidal_AIC  <  doublesigmoidalAIC`, then the data cannot be labeled with `"double_sigmoidal"`. If `sigmoidalAIC + threshold_bonus_sigmoidal_AIC  >  doublesigmoidalAIC`, then the data cannot be labeled with `"sigmoidal"`.

The only option that is left at this point will be the label of the data and thus the final decision.
Browse the archive

https://github.com/wilkelab/sicegar