Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

Revision d64c7115b852e3a269ed9e3c069a3485b34bbea5 authored by Eric Sanford on 10 October 2019, 18:04:36 UTC, committed by Eric Sanford on 10 October 2019, 18:04:36 UTC
added location of hg19 reference files as comments to setEnvironmentVariables.sh script
1 parent c5b88eb
  • Files
  • Changes
  • 7fe3ab4
  • /
  • LocalComputerScripts
  • /
  • normalizeMeltedCountsAndAddGeneSymbol.R
Raw File Download

To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
Select below a type of object currently browsed in order to display its associated SWHID and permalink.

  • revision
  • directory
  • content
revision badge
swh:1:rev:d64c7115b852e3a269ed9e3c069a3485b34bbea5
directory badge Iframe embedding
swh:1:dir:f3239410a6906e28c4bcf9a8be7d9932c24aff2e
content badge Iframe embedding
swh:1:cnt:711a58badcfd1523ceaea886740ee2dcd9bfe4ab

This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
Select below a type of object currently browsed in order to generate citations for them.

  • revision
  • directory
  • content
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
normalizeMeltedCountsAndAddGeneSymbol.R
library(tidyverse)
library(GenomicFeatures)
library(here)
library(refGenome)
# note: if you get errors from any of the above "library" lines, e.g. "there is no package called ‘here’, 
#       enter the following command in the RStudio console: install.packages(<name_of_library_that_failed_to_load>)

######### here beginneth user-defined parameters #########

meltedDataInputFile   <- '/Users/emsanford/Dropbox (RajLab)/Shared_Eric/Signal_Integration/Analysis_SI2-SI4/extractedData/SI3_5sample_rerun_meltedData.tsv'
gtfFileUsedByPipeline <- '/Users/emsanford/Dropbox (RajLab)/Shared_Eric/Signal_Integration/Analysis_SI2-SI4/refs/hg38.gtf'
gtfFileDirectory      <- '/Users/emsanford/Dropbox (RajLab)/Shared_Eric/Signal_Integration/Analysis_SI2-SI4/refs'
ENSGtoGeneSymbolTable <- '/Users/emsanford/Dropbox (Personal)/Eric/Penn/raj_lab/code/rajlabseqtools/default/LocalComputerScripts/geneSymbolConversionTables/hg38_EnsgHgncSymbolMapping.tsv'
outputFile            <- '/Users/emsanford/Dropbox (RajLab)/Shared_Eric/Signal_Integration/Analysis_SI2-SI4/extractedData/SI3_5sample_rerun_RNA-seq-pipeline-output.tsv'

######### here endeth user-defined parameters ############

# parse GTF file into a first R object, used to retrieve gene names
ens <- ensemblGenome()
setwd(gtfFileDirectory)
read.gtf(ens, "hg38.gtf")
my_gene <- getGenePositions(ens)

# parse GTF file into a second R object. Use this one to make a union model from which to calculate gene lengths
txdb <- makeTxDbFromGFF(file = gtfFileUsedByPipeline, format="gtf")
lengthsPergeneid <- sum(width(IRanges::reduce(exonsBy(txdb, by = "gene"))))
lengthtbl <- tibble(gene_id = names(lengthsPergeneid), length = lengthsPergeneid)

# read in the "melted" HTSeq count table and add the following normalized values for each sample: RPM, RPKM, and TPM
htseq.table <- read_tsv(meltedDataInputFile, col_names = T)
htseq.table.withLength <- inner_join(htseq.table, lengthtbl, by = 'gene_id') # this step also removes non_gene features from the table, e.g. "__no_feature"
htseq.table.withRPM <- htseq.table.withLength %>% 
  group_by(experiment, sampleID) %>%
  mutate(totalMappedReads = sum(counts), rpm = 1000000*counts / totalMappedReads)
htseq.table.withRPKM <- htseq.table.withRPM %>%
  mutate(rpkm = 1000*rpm/length)
# note this TPM calc method doesn't take average fragment size into account (some papers use it, some papers don't)
htseq.table.withTPM <- htseq.table.withRPKM %>%
  mutate(ctsOverLength = counts/length, denominator = sum(ctsOverLength), tpm = ctsOverLength/denominator * 1000000) %>%
  dplyr::select(-c(ctsOverLength, denominator))

# now add the gene symbol to the final table and write the output file
HGNC.symbol.table <- read_tsv(ENSGtoGeneSymbolTable, col_names = T)
HGNC.symbol.table <- HGNC.symbol.table %>% mutate(gene_id = ensg) %>% dplyr::select(-ensg)
htseq.table.with.HGNCsymbol <- plyr::join(data.frame(htseq.table.withTPM), data.frame(HGNC.symbol.table), by = 'gene_id', type="left", match="first")  #uses plyr join on a data frame instead of dplyr left_join on a tibble due to first match option
htseq.table.almostfinal <- as_tibble(htseq.table.with.HGNCsymbol)

# now add the HGNC gene symbol to the final table and write the output file
HGNC.symbol.table <- read_tsv(ENSGtoGeneSymbolTable, col_names = T)
HGNC.symbol.table <- HGNC.symbol.table %>% mutate(gene_id = ensg) %>% dplyr::select(-ensg)
htseq.table.with.HGNCsymbol <- plyr::join(data.frame(htseq.table.withTPM), data.frame(HGNC.symbol.table), by = 'gene_id', type="left", match="first")  #uses plyr join on a data frame instead of dplyr left_join on a tibble due to first match option
htseq.table.almostfinal <- as_tibble(htseq.table.with.HGNCsymbol)

# add hg38.gtf names to the genes that are missing hgnc symbols
indices.genes.missing.hgnc.symbols <- which(is.na(htseq.table.almostfinal$hgnc_symbol))
gene.ids.missing.symbols <- htseq.table.almostfinal$gene_id[indices.genes.missing.hgnc.symbols]
temp.table1 <- data.frame(tibble(gene_id = gene.ids.missing.symbols))
temp.table2 <- data.frame(tibble(gene_id = my_gene$gene_id, name = my_gene$gene_name))
temp.table <- plyr::join(temp.table1, temp.table2, by = 'gene_id', type="left", match="first") 
htseq.table.almostfinal$hgnc_symbol[indices.genes.missing.hgnc.symbols] <- temp.table$name
stopifnot(sum(is.na(htseq.table.almostfinal$hgnc_symbol)) == 0)
# rename HGNC column to "gene_name" column. Gene names are the HGNC symbol if one exists for the ensembl ID and the gene_name in the gtf file otherwise.
htseq.table.almostfinal[["gene_name"]] <- htseq.table.almostfinal$hgnc_symbol
htseq.table.almostfinal <- dplyr::select(htseq.table.almostfinal, -hgnc_symbol)

htseq.table.final <- htseq.table.almostfinal %>% dplyr::select(-length, -totalMappedReads)
write_tsv(htseq.table.final, outputFile, col_names = T)


####### if you need to make a new ENSGtoGeneSymbolTable, uncomment, edit, and run this block of code ###############
# library(biomaRt)
# library(tidyverse)
# library(GenomicFeatures)
# library(here)
# gtfFileUsedByPipeline <- '/Users/emsanford/Downloads/Homo_sapiens.GRCh38.97.gtf'
# txdb <- makeTxDbFromGFF(file = gtfFileUsedByPipeline, format="gtf")
# vectorOfENSG_IDs_to_find_HGNC_symbols_for <- names(sum(width(IRanges::reduce(exonsBy(txdb2, by = "gene")))))  ## this is just a vector of "ENSG0000XXXX" strings. This line of code gives you this vector from a parsed GTF file that you make below
# outputTableWithGeneNameSymbols <- here('newGeneNameSymbolTable.tsv') 
# ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl", host = "uswest.ensembl.org", ensemblRedirect = FALSE)
# hgnc.name.table <- getBM(attributes=c('ensembl_gene_id', 'hgnc_symbol'), 
#                          filters = 'ensembl_gene_id', 
#                          values = vectorOfENSG_IDs_to_find_HGNC_symbols_for, 
#                          mart = ensembl)
# 
# hgnc.name.tibble <- tibble(ensg = hgnc.name.table$ensembl_gene_id, hgnc_symbol = hgnc.name.table$hgnc_symbol)
# write_tsv(hgnc.name.tibble, outputTableWithGeneNameSymbols, col_names = TRUE)
####################################################################################################################
The diff you're trying to view is too large. Only the first 1000 changed files have been loaded.
Showing with 0 additions and 0 deletions (0 / 0 diffs computed)
swh spinner

Computing file changes ...

back to top

Software Heritage — Copyright (C) 2015–2025, The Software Heritage developers. License: GNU AGPLv3+.
The source code of Software Heritage itself is available on our development forge.
The source code files archived by Software Heritage are available under their own copyright and licenses.
Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API