Documentation

Introduction

Express is a database of transcriptome profiles encompassing known and novel transcripts across multiple developmental stages in eye tissues in mouse. Express contains transcript level expression data obtained from 18 lens and 35 retina RNA-Seq mouse samples. We downloaded the raw datasets, aligned them to reference genome and quantified transcript level expression for known and novel transcripts. We then downloaded the reference gene and transcript information and organized them along with the expression data in a MySQL database. We finally developed a PHP backend to interact with the database and a frontend to interact with the user and to visualize the query results.

Datasets

We downloaded 21 mouse lens and 35 mouse retina samples across different developmental stages varying from E15 to P90. Please see Table 1 and Table 2 in our publication for more details.

Preprocessing

The downloaded raw datasets (in FASTQ format) were aligned to reference mouse genome (mm10) using HISAT. The alignment files (in SAM format) were processed to sorted BAMs later indexing them using SAMtools. The sorted BAM files were then given to StringTie for transcript quantification and discovery along with a reference mouse transcripts obtained from Ensembl. The GTF files storing the expression levels for known and novel transcipts provided by StringTie were then used to generate a reference annotation file including novel transcripts using StringTie "merge" mode. After the reference annotation file with novel transcripts were obtained, we reran StringTie with the sorted BAM files giving the new reference annotation file to collect the GTF files including expression levels for transcripts including known and novel transcripts. Then, we did quantile normalization for the lens and retina samples separately. The final tables with normalized expression levels are then organized into an SQL table.

We also downloaded gene information from Ensembl BioMart and HGNC for gene alias, gene name, gene ID and transcript ID relationships for all known transcripts. We also downloaded transcript information from Ensembl including gene ID and transcript ID. These two tables are then converted into SQL tables and together with the expression data, they were put in a MySQL database.

Download

The dump of MySQL database can be downloaded using this link (express.sql.gz, compressed 242 MB). The complete guide to set up a local server of Express is given on its GitHub page including the source code.

Usage

Go to Home page, select a tissue type/subtype and enter a query (or pick one of the sample queries), and then click search button. The results will be shown as heatmap by default and the raw expression data obtained will be filtered with > 5 TPM. The heatmap includes transcripts in its rows and developmental stage:cell subtype in its columns. When there is no cell subtype given for a developmental stage, it is the whole tissue rather than a particular cell subtype (for lens; E: epithelium, F: fiber and for retina; C: cones, R: rods). You can later change TPM cutoff settings to filter expression data for different TPM cutoffs and switch to the quantile normalized expression data across samples per tissue type rather than raw expression data. The browser view can be toggled using the button in the right hand side of the navigation. Similarly, you can also toggle the heatmap view. Both views and heatmap data can be exported using the Export button on the right hand side of the navigation. The views will be exported in SVG (scalable vector graphics) format and the data will be exported as TSV (Tab-separated values). The exported data will include gene name, transcript ID, developmenal stage, NCBI BioProject ID, PubMed ID, study reference, novelty flag, averaged raw TPM value across samples and averaged normalized TPM value across samples. Novelty flags can be 0, 1 and 2. 0 means it is a known (annotated) transcript (shown as ENSMUSTXXXXXXXXXXX); 1 means it is an unannnotated transcript (shown as MSTRG.XXXX.XXXXX.X); 2 means it is a completely novel transcript (shown as MSTRG.XXXX.XXXXX.X).

API

The backend PHP API allows us to query the MySQL database for expression levels of transcripts given a tissue type, a TPM cutoff and a query (e.g. gene synonym/name, Ensembl gene ID, MGI gene ID, Ensembl transcript ID or chromosomal location). The API URL as follows: https://sysbio.sitehost.iu.edu/express/app/api.php and accepts three GET parameters, expression, query, tissue, cutoff and value (e.g. https://sysbio.sitehost.iu.edu/express/app/api.php?expression=transcript&query=Cryb2&tissue=lens&cutoff=1&value=raw).

Input

The expression parameter can be one of the following:

gene
transcript

The query parameter can be one of the following:

Gene name
Ensembl gene ID
MGI gene ID
Ensembl transcript ID
Chromosomal location

The tissue parameter can be one of the following:

lens
lens_subtype
retina

The cutoff parameter can be one of the following:

The value parameter can be one of the following:

normalized, TPM value after quantile normalization per tissue type
raw, TPM value without quantile normalization

Output

The output will be a JSON array of objects with following properties:

gene / transcript, Ensembl gene ID / Ensembl transcript ID
gene_name, gene name
stage, developmental stage
bioproject_id, NCBI BioProject ID
pubmed_id, PubMed ID
reference, study reference
novelty, novelty flag
value_raw, averaged raw TPM value across samples
value_normalized, averaged normalized TPM value across samples
location, chromosomal location
value, normalized/raw value for heatmap view

Express