Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

https://doi.org/10.5281/zenodo.15302315
14 July 2025, 12:08:15 UTC
  • Code
  • Branches (0)
  • Releases (2)
  • Visits
    • Branches
    • Releases
      • 2
      • 2
      • 1
    • 8dcf251
    • /
    • monahton-GencoDymo2-f7b52ac
    • /
    • vignettes
    • /
    • GencoDymo2_vignette.Rmd
    Raw File Download

    To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
    Select below a type of object currently browsed in order to display its associated SWHID and permalink.

    • content
    • directory
    • snapshot
    • release
    origin badgecontent badge
    swh:1:cnt:f19e3f791ebca1e10d9e2498a78cd353e1d905f7
    origin badgedirectory badge
    swh:1:dir:b65ff098a7f85f944d89e9e8c4ab0e741507199e
    origin badgesnapshot badge
    swh:1:snp:0fc2e64fecfe96a27b5d76066020941775289584
    origin badgerelease badge
    swh:1:rel:32a4a7bfb9160d3aebd8d3e9b3890f5f3af88001

    This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
    Select below a type of object currently browsed in order to generate citations for them.

    • content
    • directory
    • snapshot
    • release
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    Generate software citation in BibTex format (requires biblatex-software package)
    Generating citation ...
    GencoDymo2_vignette.Rmd
    ---
    title: "GencoDymo2: Comprehensive Analysis of GENCODE Annotations and Splice Site Motifs"
    date: "`r Sys.Date()`"
    author: 
    - name: Monah Abou Alezz<sup>1</sup>, Lorenzo Salviati<sup>2</sup>, Roberta Alfieri<sup>3</sup>, Silvia Bione<sup>4</sup>
      affiliation: "<sup>1</sup>San Raffaele Telethon Institute for Gene Therapy, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) San Raffaele Scientific Institute, 20132 Milan, Italy<br>
      <sup>2</sup>Human Technopole, Viale Rita Levi-Montalcini 1, 20157 Milan, Italy<br>
      <sup>3</sup>National Research Council, Institute for Biomedical Technologies, 20054 Segrate, Italy<br>
      <sup>4</sup>Computational Biology Unit, Institute of Molecular Genetics Luigi Luca Cavalli-Sforza, National Research Council, 27100 Pavia, Italy"
    keywords: GENCODE, gene annotations, introns, splice sites annotations, splice sites scores
    output: 
      html_document:
        toc: true
        toc_depth: 5
        toc_float:  
         collapsed: true  
        number_sections: false
        theme: cerulean
    vignette: >
      %\VignetteIndexEntry{GencoDymo2_vignette}
      %\VignetteEngine{knitr::rmarkdown}
      %\VignetteEncoding{UTF-8}
    ---
    
    <style>
    a:link {
        color: red;
    }
    
    a:visited {
        color: green;
    }
    
    a:hover {
        color: hotpink;
    }
    p.caption {
      font-size: 0.8em;
    }
    
    </style>
    
    ***
    
    ```{r setup, include=FALSE}
    knitr::opts_chunk$set(
      echo = TRUE,
      warning = FALSE,
      message = FALSE,
      fig.width = 8,
      fig.height = 6
    )
    ```
    
    <br>
    
    
    # Introduction
    
    ---
    
    Gene annotations are provided by several databases that integrate data from systematic explorations of various experimental and computational resources. The **GENCODE** project [1], part of the ENCODE project [2], offers accurate annotations of the human and mouse genomes derived from literature, primary data repositories, computational predictions, manual annotation, and experimental validation of genes and transcripts, including noncoding transcripts that are continually discovered and annotated. GENCODE is one of the most comprehensive and standardized databases for gene annotations, widely used by the scientific community.​
    
    The gene sets provided by GENCODE are comprehensive and include protein-coding and non-coding loci, encompassing alternatively spliced isoforms and pseudogenes. GENCODE gene annotations are regularly updated and released as the Ensembl/GENCODE gene sets, accessible via the official website <https://www.gencodegenes.org>. As of October 2024, the latest human release is GENCODE 47, and the latest mouse release is GENCODE M36 [1]. GENCODE gene sets are released approximately four times a year for mouse and twice a year for human [1].​
    
    **GencoDymo2** is an R package designed to interrogate different releases of GENCODE annotations from the human and mouse genomes. It provides a streamlined interface for accessing and processing GENCODE annotations, providing easily accessible data regarding annotation statistics, release comparisons, and information on introns and splice sites. Moreover, GencoDymo2 can produce FASTA files of donor and acceptor splice site motifs that can be directly uploaded to the MaxEntScan tool [3] for the calculation of splice site scores.
    
    <br>
    
    GencoDymo2 can be used to import and process any gtf/gff3 formatted file. It automatizes and speeds up the following data manipulation steps:   
    
     * Dynamic detection of latest GENCODE releases from the official website
     * Unified interface for GTF/GFF3 file retrieval for the human and mouse genomes
     * Compare annotations from different GENCODE releases
     * Perform summary statistics on gene annotations
     * Extract introns information
     * Assign splice sites consensus sequence
     * Compile MaxEntScan splice sites motifs
    
    ***
    <br>
    
    # Installation
    
    ---
    
    GencoDymo2 runs in the R statistical computing environment. You will need R version 4.1.0 or higher. The most recent version of GencoDymo2 can be installed from GitHub. We recommend using the pak package for fast and reliable installation.
    
    ```{r install, eval=FALSE}
    # Install pak if not already installed
    if (!require("pak")) install.packages("pak")
    # Install from GitHub
    pak::pkg_install("github::monahton/GencoDymo2")
    
    # Load the package
    library(GencoDymo2)
    ```
    
    ***
    
    <br>
    
    # Usage
    
    ---
    
    ## Latest Releases
    
    GENCODE releases annotations approximately every three months, coinciding with Ensembl releases. The latest release from GENCODE for human and mouse genomes can be obtained using the <span style="color:coral">`get_latest_release()`</span> function. This function queries the official GENCODE website to determine the most recent release for a given species (human or mouse).
    
    ```{r get_release, eval=FALSE}
    # Fetch the most recent human and mouse GENCODE release identifiers
    human_release <- get_latest_release("human", verbose = T)
    mouse_release <- get_latest_release("mouse", verbose = T)
    ```
    
    
    ```{r get_release_ex, echo=FALSE}
    # Get latest human and mouse release
    cat("Latest human GENCODE release: release_47")
    cat("Latest human GENCODE release: release_M36") 
    ```
    
    <br>
    
    > **_NOTE:_** Behind the scenes, get_latest_release() parses the Ensembl public FTP site, extracts the release version (e.g., "release_47"), and returns it for programmatic use.
    
    <br>
    
    ## Download GTF/GFF3 Files
    
    GencoDymo2 allows the download of `gtf` and `gff3` files of the latest release or any release specified by the user. These files can be obtained for human and mouse genome and can be saved in a path specified by the user. With the release identifiers in hand, we can download these annotation files:
    
    ```{r get-files, eval=FALSE}
    
    # Download latest human long noncoding RNAs GTF
    lnc_47_gtf <- get_gtf(
      species = "human",
      release_version = human_release,
      annotation_type = "long_noncoding_RNAs.gtf.gz",
      dest_folder = tempdir()
    )
    
    # Download previous human release (release_46) for comparison
    lnc_46_gtf <- get_gtf(
      species = "human",
      release_version = "release_46",
      annotation_type = "long_noncoding_RNAs.gtf.gz",
      dest_folder = tempdir()
    )
    
    # Download latest mouse primary assembly annotations (GFF3)
    mouse_36_gff3 <- get_gff3(
      species = "mouse",
      release_version = mouse_release,
      annotation_type = "primary_assembly.annotation.gff3.gz",
      dest_folder = tempdir()
    )
    ```
    
    ### Supported Annotation Types
    
    The valid annotation types for both the gtf and gff3 format can be one of the following:
    ```{r annotation-types, echo=FALSE}
    cat("Valid Annotation Types:\n")
    valid_annotation_types <- c(
        "annotation",
        "basic.annotation",
        "chr_patch_hapl_scaff.annotation",
        "chr_patch_hapl_scaff.basic.annotation",
        "long_noncoding_RNAs",
        "primary_assembly.annotation",
        "primary_assembly.basic.annotation",
        "tRNAs",
        "polyAs")
    valid_annotation_types
    
    ```
    
    <br>
    
    ## Load and Inspect
    
    A `gtf` or `gff3` file can be imported to R as a dataframe using the <span style="color:coral">`load_file()`</span> function. Once downloaded, load the files into R as data frames using load_file(). This function automatically detects GTF vs GFF3 and parses the attributes into columns.
    
    ```{r load-data, eval=FALSE}
    # Loading using the stored paths from previous steps
    lnc_47_df <- load_file(lnc_47_gtf)
    head(lnc_47_df)
    
    # Alternatively, specify the file path directly
    lnc_46_df <- load_file(file.path(tempdir(), "gencode.v46.long_noncoding_RNAs.gtf.gz"))
    head(lnc_46_df)
    
    # Load mouse GFF3
    mouse_pri_36 <- load_file(file.path(tempdir(),"gencode.vM36.primary_assembly.annotation.gff3.gz"))
    head(mouse_pri_36)
    ```
    
    <br>
    
    ## Compare Releases
    
    As GENCODE regularly updates its annotations, different releases from GENCODE contain differing number of annotations. Some genes are added in new releases while others are classified into different gene classes. In the majority of cases, the Ensembl gene ID of genes remains constant while the gene name might vary. \
    To quantify changes between two releases, use <span style="color:coral">`compare_release()`</span>. Specify the two annotation data frames and the type of feature to compare ("gene", "transcript", "exon", etc.). Additional filters like the gene_type and the baseline can allow to focus on subsets (e.g., only TEC or protein_coding genes) and different reference counts for the calculation of the difference percentage.
    
    ```{r compare-releases, eval=FALSE}
    # Compare gene counts between release 47 and 46
    gene_comparison <- compare_release(lnc_47_df, lnc_46_df, type = "gene")
    
    # Compare exon counts
    exon_comparison <- compare_release(lnc_47_df, lnc_46_df, type = "exon")
    
    # Compare a specific gene biotype (e.g., TEC) using a custom baseline
    comparison <- compare_release(
      lnc_47_df,
      lnc_46_df,
      type = "gene",
      gene_type = "TEC",
      baseline = "count1"
    )
    ```
    
    The function produces as an output a _**Delta**_ value, that corresponds to the difference in number between the specified elements of the two inputs, in addition to its **_Percentage_**.
    
    <br>
    
    > **_NOTE:_** The output is a data frame summarizing absolute and percentage differences in feature counts, enabling rapid assessment of annotation growth or refinement.
    
    <br>
    
    ## Extract Introns Coordinates
    
    The gtf files provided by GENCODE list genes, transcripts, and exons while intronic regions must be inferred. Given the loaded data frame from the gtf files of GENCODE, the <span style="color:coral">`extract_introns()`</span> function computes intron coordinates for each transcript by sorting exon ranges and identifying gaps. It returns a dataframe containing introns coordinates and position within each transcript.
    
    
    ```{r introns, eval=FALSE}
    # Human lncRNA introns for release 47
    introns_lnc_47 <- extract_introns(lnc_47_df, verbose = T)
    
    # Mouse introns (filtering to primary chromosomes first)
    mouse_pri_36 <- mouse_pri_36[grepl("^chr", mouse_pri_36$seqnames), ]
    mouse_introns_pri_36 <- extract_introns(mouse_pri_36, verbose = T)
     
    ```
    
    <br>
    
    ## Extract Donor and Acceptor Splice Sites
    
    Splice sites are defined by dinucleotide motifs at intron boundaries. At the 5' end, the donor site includes an almost invariant sequence GT (GU in the RNA). At the 3' end, the intron terminates with the acceptor site, an almost invariant AG sequence.
    
    <br>
    
    ![Fig1. Donor and acceptor splice sites consensus. Image taken from (https://commons.wikimedia.org/wiki/File:Consensus_intron.jpg). ](Consensus_intron.jpg)
    
    <br>
    
    The majority of introns belong to the U2-type spliceosome and are flanked by GT–
    AG splice site dinucleotides. The most frequent exception to this rule are the U2-type GC–AG splice sites, comprising ∼0.8% of human splice sites and about ∼0.09% of the human splice sites belong to the U12-type which are processed by the minor spliceosome
    and are flanked by AT–AC dinucleotides [4]. \
    The <span style="color:coral">`assign_splice_sites()`</span> function retrieves these flanking sequences from a BSgenome object. It adds two new columns to the data frame containing the introns coordinates, assigning both the 5' and 3' splice sites. Sequences should be retrieved from a loaded BSgenome object such as `BSgenome.Hsapiens.UCSC.hg38` for human and `BSgenome.Mmusculus.UCSC.mm39` for mouse.
    
    ```{r splice-sites, eval=FALSE}
    # Human
    library(BSgenome.Hsapiens.UCSC.hg38)
    lnc_47_ss <- assign_splice_sites(
      introns_lnc_47,
      genome = BSgenome.Hsapiens.UCSC.hg38,
      verbose = T
    )
    
    # Mouse
    library(BSgenome.Mmusculus.UCSC.mm39)
    mouse_pri_36_ss <- assign_splice_sites(
      mouse_introns_pri_36,
      genome = BSgenome.Mmusculus.UCSC.mm39,
      verbose = T
    )
    ```
    
    <br>
    
    ### Identify Cryptic splice sites
    
    Cryptic splice sites are those that do not match the canonical donor (GT) or acceptor motifs (AG). The <span style="color:coral">`find_cryptic_splice_sites()`</span> function identifies potential cryptic splice sites by comparing sequence motifs in introns to canonical splice site motifs (donor and acceptor) that can be also specified by the user. It compares the identified splice sites with the provided canonical motifs and flags the sites that differ from the canonical patterns, making it useful for studying aberrant splicing events.
    
    ```{r cryptic, eval=FALSE}
    # Identify cryptic (non-canonical) splice sites
    cryptic_ss <- find_cryptic_splice_sites(
      lnc_47_ss,
      genome = BSgenome.Hsapiens.UCSC.hg38,
      canonical_donor = "GT",
      canonical_acceptor = "AG",
      verbose = TRUE
    )
    ```
    
    <br>
    
    ### Extract splice sites motifs for MaxEntScan webtool
    
    **MaxEntScan** is a webtool based on the approach for modeling the sequences of short sequence motifs involved in RNA splicing which simultaneously accounts for non-adjacent as well as adjacent dependencies between positions. This method is based on the _'Maximum Entropy Principle'_ and generalizes most previous probabilistic models of sequence motifs such as weight matrix models and inhomogeneous Markov models [3]. \
    The **MaxEntScan::score5ss** assign scores for donor splice sites according to four models by using **9-mer SEQUENCES** motif as input in a `.fa file`. The 9-mers motif contain 3 bases in the exon and 6 bases in the intron. \
    The **MaxEntScan::score3ss** assign scores for donor splice sites according to three models. It takes as input a `.fa file` of **23-mer SEQUENCES** in which 20 bases are in the intron and 3 bases in the exon. \
    The <span style="color:coral">`extract_ss_motif()`</span> function is used along with BSgenome object of the studied species to retrieve _MaxEntScan::score5ss_ and _MaxEntScan::score3ss_ motif sequences. \
    The user can choose to generate a **fasta file** in the working directory or any specified path as an output, that contains either 9-mers or 23-mers, respectively for each 5' and 3' splice-sites. It generates also a **dataframe** for the coordinates and IDs of the corresponding motifs. \
    The generated fasta files can then be directly utilized by _MaxEntScan_ tools. \
    The first argument `input` is a dataframe containing the intron coordinates. The second argument `genome` is a BSgenome object (the human genome sequence hg38 is used by default). The third argument `type` specifies whether to extract motifs at the 5ss or the 3ss. Users can optionally set the argument `save_fasta` to TRUE to and the `output_file` argument to save a fasta file.
    
    
    ```{r motifs, eval=FALSE}
    # Donor motifs (5'ss)
    motifs_donor <- extract_ss_motif(
      input = lnc_47_ss,
      genome = BSgenome.Hsapiens.UCSC.hg38,
      type = "5ss",
      verbose = T,
      save_fasta = T,
      output_file = file.path(tempdir(), "lnc_47_5ss_motifs.fa")
    )
    
    # Acceptor motifs (3'ss)
    motifs_acc <- extract_ss_motif(
      input = lnc_47_ss,
      genome = BSgenome.Hsapiens.UCSC.hg38,
      type = "3ss",
      verbose = T,
      save_fasta = T,
      output_file = file.path(tempdir(), "lnc_47_3ss_motifs.fa")
    )
    ```
    
    ***
    
    <br>
    
    # Additional Utilities
    
    ---
    
    ### Single-Exon Features
    
    Some genes are unspliced: composed by one single exon and have no introns.   
    The <span style="color:coral">`extract_single_exon()`</span> function can extract all the **_single-exon_** genes or transcripts and returns in a dataframe with their ensembl gene IDs or transcript IDs.
    
    ```{r unspliced, eval=FALSE}
    ## identify single exon genes and transcripts
    single_exon_genes <- extract_single_exon(lnc_47_df, level = "gene")
    single_exon_trans <- extract_single_exon(lnc_47_df, level = "transcript")
    ```
    
    <br>
    
    ### Exon Classification
    
    The <span style="color:coral">`classify_exons()`</span> adds a new column `(EXON_CLASSIFICATION)` to the input dataframe, describing the position of each exon. It labels each exon as _(first_exons)_, _(inner_exons)_, _(last_exons)_ and _(single_exons)_ facilitating position-based analyses. This classification is informative considering that first and last exons often have peculiar regulative roles.  
    
    ```{r exon_class, eval=FALSE}
    # Assign the ordinal position of exons
    lnc_47_class_exons <- classify_exons(lnc_47_df, verbose = TRUE)
    ```
    
    <br>
    
    ### Spliced Transcript Length
    
    The <span style="color:coral">`spliced_trans_length()`</span> function computes the mature (post-splicing) transcript length by summing exon widths per transcript.
    
    ```{r eval=FALSE}
    # Length of spliced transcript  
    lnc_47_spliced_length <- spliced_trans_length(lnc_47_df)
    head(lnc_47_spliced_length)
    ```
    
    <br>
    
    ### Summary Statistics
    
    GencoDymo2 provides the <span style="color:coral">`stat_summary()`</span> function that gives a brief summary of data annotations in a particular GENCODE release regarding genes, transcripts, exons and introns.
    
    ```{r stat, eval=FALSE}
    # Exon length statistics
    lnc_47_exon_stats <- stat_summary(lnc_47_class_exons, type = "exon")
    
    # Intron length statistics
    lnc_47_intron_stats <- stat_summary(introns_lnc_47, type = "intron")
    ```
    
    <br>
    
    ### GC Content Calculation
    
    The <span style="color:coral">`calculate_gc_content()`</span> function  computes GC percentage for any feature set using a BSgenome reference.
    
    ```{r gc-content, eval=FALSE}
    # Human
    lnc_47_gc <- calculate_gc_content(
      lnc_47_df,
      genome = BSgenome.Hsapiens.UCSC.hg38,
      verbose = TRUE
    )
    # Mouse
    mouse_pri_36_gc <- calculate_gc_content(
      mouse_pri_36,
      genome = BSgenome.Mmusculus.UCSC.mm39,
      verbose = TRUE
    )
    ```
    
    <br>
    
    ### Extract CDS Sequences
    
    For coding annotations, <span style="color:coral">`extract_cds_sequences()`</span> writes CDS FASTA files from GRanges objects.
    
    ```{r cds, eval=FALSE}
    # Convert to GRanges and extract
    library(GenomicRanges)
    mouse_pri_36_granges <- GRanges(mouse_pri_36)
    mouse_cds_seqs <- extract_cds_sequences(
      mouse_pri_36_granges,
      BSgenome.Mmusculus.UCSC.mm39,
      save_fasta = TRUE,
      output_file = file.path(tempdir(), "mouse_pri_36_CDS.fa.gz")
      verbose = TRUE
    )
    ```
    
    ***
    
    <br>
    
    # Session Information
    
    ---
    
    The following provides the session information used when compiling this document.
    
    ```{r eval=TRUE, echo=FALSE}
       devtools::session_info()
    ```
    
    ***
    
    <br>
    
    # References
    
    ---
    
    1. Frankish, A., Diekhans, M., Ferreira, A.M., et al. (2023). _GENCODE: reference annotation for the human and mouse genomes in 2023_. Nucleic Acids Research, 51(D1), D942–D949. \
    2. ENCODE Project Consortium. (2012). _An integrated encyclopedia of DNA elements in the human genome_. Nature, 489(7414), 57–74. \
    3. Yeo, G., & Burge, C.B. (2004). _Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals_. Journal of Computational Biology, 11(2-3), 377–394. \
    4. Parada GE, Munita R, Cerda CA and Gysling K. _“A comprehensive survey of non-canonical splice sites in the human transcriptome”_. Nucleic Acids Res. 2014. 42: 10564-10578. \
    5. Abou Alezz, M., Celli, L., Belotti, G., et al. (2020). _GC-AG Introns Features in Long Non-coding and Protein-Coding Genes Suggest Their Role in Gene Expression Regulation_. Front. Genet. 2020 11:488.
    

    back to top

    Software Heritage — Copyright (C) 2015–2026, The Software Heritage developers. License: GNU AGPLv3+.
    The source code of Software Heritage itself is available on our development forge.
    The source code files archived by Software Heritage are available under their own copyright and licenses.
    Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API