Content - 97c0f39e2e6b7375b6cd4f65458ccf833ee7c626 - a24b54a/Supereffectors_PART1_1_generating_sample_lists.Rmd

visit type:
https://github.com/Lab-of-Adaptive-Immunity/Supereffectors_scRNAseq

02 February 2023, 13:36:01 UTC
Tip revision: 7b0dec9507dd45ab4bb0619912b240f512c7798f authored by JurMich on 26 November 2022, 12:08:33 UTC
Update README.md
Tip revision: 7b0dec9
Supereffectors_PART1_1_generating_sample_lists.Rmd
---
title: "exp07 W3 Treg Cell Separation by Hashtags"
author: "Juraj Michalik and Veronika Niederlova"
date: "02/01/2021"
output: html_document
---

**WARNING: You have to configure this script before using it, at the very least you have to set file paths! See Configuration section.**

This script separates cells by available hashtags.

The input and output of this module are as follows:

* **Input**:
  + **raw feature matrix** as well as **list of hastags you want to identify** (see Configuration).
  
* **Output**:
  + A list of barcodes for each strain. These lists will be stored as **.csv** files. The lists are then fed to the algorithm that extracts reads with corresponding barcodes, and maps them on reference to obtain **raw feature matrices** for each strain separately or collectively for strains we want to obtain.

## Prerequisites
Having mapped reads previously - cellranger run on raw data. Regarding this analysis, the following packages are needed:

```{r packages}
suppressMessages(require(Seurat))
suppressMessages(require(ggplot2))
suppressMessages(require(cowplot))
suppressMessages(require(data.table))
suppressMessages(require(pracma))
suppressMessages(require(dplyr))
suppressMessages(require(RColorBrewer))
suppressMessages(require(grDevices))
suppressMessages(require(gplots))
suppressMessages(require(kableExtra))
suppressMessages(require(knitr))
```

Following function is used to detect the end of initial slope for histogram:

* **get.auto.minimum** - gets automatically minimum for marker **marker.val** from assay **antibody.assay**. The function makes three assumptions about the histogram of antibody counts **histogram.vector**:
  + there are exactly two local maxima (one for cells not having the marker, one for cells having it);
  + one of those local maxima is at counts for 0 antibodies (cells not having the marker);
  + the local minimum we search is between them.
  
  Note that the first hypothesis is false, as usually the histogram will be rugged, however we can somewhat fix it by smoothing it first. This is done by using sliding window averaging *n* consecutive values on this window (default value is 5). On resulting graph we perform the simple gradient walk from counts at 0-1 (first bin merges counts for cells with 0 and 1 marker molecule together). The value where the gradient walk stops is the desired limit by half of the window (we offset it by one as it seems to produce better results and also because we count here only excluded values).

```{r functions}
get.auto.minimum <- function(histogram.vector, n.smooth=5){
  # smoothing
  smooth.hist <- stats::filter(histogram.vector, rep(1/n.smooth, n.smooth))
  smooth.hist <- smooth.hist[!is.na(smooth.hist)]
  # gradient walk
  i = 1
  gr.walk.val <- smooth.hist[i]
  while(gr.walk.val > smooth.hist[i+1] && i<length(smooth.hist)){
    gr.walk.val <- smooth.hist[i+1]
    i = i + 1
  }
  return(i-1+ceiling(n.smooth/2)) # also count in the cells just before the minimum (-1)
}
```

## Configuration

This part defines the configuration variables of entire module. As it is, **it should be necessary to modify only this section of the file to adjust it for new experiment**.

This experiment is defined by an unique identifier that is described as follows:

* 1<sup>st</sup> part of identfier, **experiment.ID**, defines the experiment;
* 2<sup>nd</sup> part of identifier, **sample.ID**,  designs the sample;

In this experiment, the identifier is defined as follows:

```{r identifier}
experiment.ID <- 'exp07'
sample.ID <- 'W3'
sep01 <- paste(experiment.ID, sample.ID, sep='_')
```

This module also requires the precision of following parameters. The description of the parameters is as follows:

* **Files:**
  + **raw.matrix.path** - path to **raw matrix** of the experiment defined by the ID above, *must contain feature barcoding*;
* **Directories:**
  + **target.path** - path where files should be stored;
* **Pre-filtering:**
  + **min.GE** - minimum molecules of GE per cell to keep given cell; 
* **Strains: **
  + **sample.strains** - list of strains present in analyzed sample.
  + **strain.markers** - name of strain markers, identical to those contained in raw matrix and **in the same order as strains in strain.markers** (ie. first value in *strain.markers* is the marker for first strain in *sample strains*);
* **Splitting:**
  + **split.auto:** - attempts to automatically estimate the values in **cutoff.lim** from local minima in ordered histogram of counts and determine threshold. FALSE for disabled, TRUE for enabled. 
  + **cutoff.values:** - is computed manually. Can be adjusted after computation.

**Note: Don't forget to redefine the paths!**
 
```{r parameters}
raw.matrix.path <- '/home/michalik/PRIMUS/data/48_lab/Project scRNAseq/DATA_scRNAseq/Experiments/Exp07/04_Raw data/Raw/NovaSeq/E07_W3_GEX/raw_feature_bc_matrix/'
target.path <- 'exp07_W3_Treg'
min.GE <- 200
sample.strains <- c("Mouse_Hashtag_1", "Mouse_Hashtag_2", "Mouse_Hashtag_3",  "Mouse_Hashtag_4",  "Mouse_Hashtag_5",  "Mouse_Hashtag_6",  "Mouse_Hashtag_7",  "Mouse_Hashtag_8", "Mouse_Hashtag_9", "Mouse_Hashtag_10")
strain.markers <- sample.strains
split.auto <- TRUE
cutoff.values <- c()
```

## Summary - Set Variables

```{r param summary, echo=FALSE}
sample.strains.txt <- paste(sample.strains, collapse= ", ")
strain.markers.txt <- paste(strain.markers, collapse= ", ")
cutoff.values.txt <- paste(cutoff.values, collapse= ", ")

sum.up <- rbind(raw.matrix.path,
min.GE,
sample.strains.txt,
strain.markers.txt,
split.auto,
cutoff.values.txt)

rownames(sum.up) <- gsub('.txt', '',  rownames(sum.up))
kable(sum.up, "html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("hover", "condensed")) 
```

## Preparing data for extraction

We load data into sparse matrix object using Seurat. We then remove all rows (= cells) that do not have at least one molecule of strain marker, and at least **min.GE** reads. 

```{r load, message=FALSE}
GE.dat <- Read10X(data.dir = raw.matrix.path)
antibodies <- GE.dat$`Antibody Capture`

# remove anything not having at least min.GE reads
filtered.cells <- colnames(GE.dat$`Gene Expression`[,Matrix::colSums(GE.dat$`Gene Expression`)>min.GE])
antibodies <- antibodies[,colnames(antibodies) %in% filtered.cells]

# remove anything that does not have at least 1 marker molecule
antibodies <- antibodies[,apply(antibodies[strain.markers,],2,sum)>0]
```

The cutoff values are computed manually by auto-detection of end of slope.

```{r autodetect mins}
if(split.auto){
  n.smooth = 5
  cutoff.values <- rep(0, length(cutoff.values))
  for(i in 1:length(strain.markers)){
    if( max(antibodies[strain.markers[i],])>0){
      hist.data <- hist(antibodies[strain.markers[i],], breaks = max(antibodies[strain.markers[i],]), plot=FALSE)
      marker.counts <- hist.data$counts
      if(length(marker.counts) > n.smooth){
          cutoff.values[i] <- get.auto.minimum(marker.counts, n.smooth)
      }else{
          cutoff.values[i] <- get.auto.minimum(marker.counts, n.smooth = length(marker.counts))
      }
    }else{
       cutoff.values[i] <- 0
    }
  } 
}
print(cutoff.values)
```

Now that we have offset values we can rework the plots to observe sectors to which the cells will be split. Specifically, if we consider *lim.A* and *lim.B* the cutoffs for antibody A and B on axis *x* and *y* respectively:

* **A < lim.A, B < lim.B** are uninteresting, lowly expressed cells, possibly empty droplets. These will be rejected altogether.
* **A > lim.A, B < lim.B** should be cells of strain marked by antibody A.
* **A < lim.A, B > lim.B** should be cells of strain marked by antibody B.
* **A > lim.A, B > lim.B** are probably doublets. They will be extracted separately and may be interesting to study for properties of multiplets, but are not interesting with regards to multiple analysis.

If there are more than two markers, these rules are changed slightly. In that case at most one marker should be expressed significantly for given cell (ie. only single marker should be over limit from *cutoff.values*), henceforth we suppose all cells with more than one marker as doublets.

## Create and output lists

We can now proceed to creating and outputting lists. These lists are then used to extract reads with given barcodes.

```{r create and extract, echo=FALSE}
colnames(antibodies) <- gsub('-1','', colnames(antibodies))
doublets <- c()

for(x in 1:(length(strain.markers)-1)){
  for(y in (x+1):length(strain.markers)){
    doublet.it <- colnames(antibodies[,antibodies[strain.markers[x],] > cutoff.values[x] &
                                      antibodies[strain.markers[y],] > cutoff.values[y], drop = FALSE
                                      ])
    doublets <- c(doublets, doublet.it)
  }
}
doublets <- unique(doublets)

fine.cells <- list()

antibodies.nodb <- antibodies[,!(colnames(antibodies) %in% doublets)]
remaining.cells <- gsub('-1', '', colnames(antibodies))
remaining.cells <- remaining.cells[!(remaining.cells %in% doublets)]
write.table(doublets, paste0(target.path, sep01, "_doublets.csv"), col.names=FALSE, row.names=FALSE)
for(i in 1:length(strain.markers)){
  cells.out <- antibodies.nodb[,antibodies.nodb[strain.markers[i],] > cutoff.values[i], drop=F]
  cells.out <- colnames(cells.out)
  remaining.cells <- remaining.cells[!(remaining.cells %in% cells.out)]
  print(paste0('Unique cells for ', strain.markers[i], ': ', length(cells.out)))
  fine.cells[[strain.markers[i]]] <- cells.out
  write.table(cells.out, paste0(target.path, sep01, '_', sample.strains[i],'.csv'), col.names=FALSE, row.names=FALSE)
}
print(paste0('Number of doublets: ', length(doublets)))
```