Methods for RNAseq (NGS) data analysis

The field of transcriptomics has recently been given a huge boost from the use of “second” generation high throughput sequencing technologies to sequence RNA samples. Second generation sequencing technologies provide an unprecedented capacity for surveying the nucleic acid content of cells. Specially since these techniques started to be applied to transcriptome sequencing we have become increasingly aware of the large number of genes that show alternative splice forms in human as well as the large variety of splice forms that these genes can have, that may range from just two splice variants to hundreds. On the other hand, the accelerating rate of data production with these new technologies is moving the bottleneck in many studies from the datas generation to the actual analysis of these data. Because of this it is important to design methods with which we can analyze them in a fast and efficient manner. Our aim is to use the data from these experiments in order to determine the exact transcript abundances within the cell. Not only as a list of the transcripts that are expressed at the qualitative level, but also the exact expression level of each transcript and alternative variant within the cell, while at the same time developing a highly automated method that will allow us to take advantage of the huge amounts of data available.  

Therefore and as part of the ENCODE projects lead by Tom Gingeras (Transcriptome) and Tim Hubbard (GENCODE), our group has been working towards the development of a number of tools for RNASeq processing. These include the GEM read aligner , the Flux Capacitor -for transcript quantification-, and NextGeneid for “de novo” transcript modeling and discovery. GEM has very powerful split mapping module, making it particularly appropriate for RNASeq mapping. The Flux Capacitor is a program to produce transcript quantifications from RNASeq data. It takes as input an annotation on a reference genome and a set of RNASeq reads mapped into this genome. We have incorporated these tools, as well as tools developed elsewhere, into GRAPE (General RNASeq Analysis Pipeline Environment), a robust, efficient and scalable software system for the storage, organization, access, and analysis of  RNASeq data. The system has three main components: a structured repository hosting the raw and processed data, an RNASeq pipeline (D. Gonzalez) to transparently produce transcript models and quantifications from sequence reads, and a common interface to both, the data and the analysis results (M. Röder & J. Lagarde). GRAPE has been used to process the RNASeq data produced by the ENCODE project and elsewhere. GRAPE is still under development, but pre-release versions have been deployed at a number of sites. GRAPE is Open Source, and a first version will be released in the public domain within the next few months.  The GRAPE dashboard, under development, offers a point of entry to the data and results through a schematics of the project’s experimental design (for instance, http://genome.crg.cat/encode_RNA_dashboard/hg19/test.html is the dashboard for the ENCODE project). It offers easy access to data files and analysis results through both a clickable HTML interface and a well-structured, easily parsable index file designed for batch data processing. The dashboard is automatically kept synchronized with data released by the ENCODE Data Coordination Center, but can also list extra file entries absent from that source. It is a widely used resource in the community (1,000 server hits per month on average).

We have also recently developed statistical methods to assess variability of splicing ratios, to identify genes with condition specific splicing patterns, and to deconvolute the contribution of expression variation vs splicing variation to the variation in the abundance of alternative splice forms (Gonzalez-Porta, M, Calvo M, Sammeth M and Guigo R. 2011. Genome Research. “Estimation of alternative splicing variability in human populations”). Moreover, by analyzing the ENCODE data using these methods, we have found that a large fraction of the variability in transcript abundance can be simply explained by the variability in global gene expression ("Landscape of transcription in human cell lines". The ENCODE RNA Group, including Djebali S, Lagarde J, Merkel A, Tanzer A, Röder M, Ferreira P, Tilgner H, Gonzalez D, Skancke J, Curado J, Derrien T, Ribeca P, Alioto T, Sammeth M, Kingswood C, Johnson R, Guigó R. In preparation).

Finally, as part of the GENCODE project, we have developed a pipeline for experimental verification of transcript annotations. The pipeline, which we termed RT-PCRSeq, is based on an efficient multiplexing of RT-PCR reactions using high throughput sequencing ("The combination of RT-PCR-seq and RNA-seq is essential to catalog all genic elements encoded in the human genome." Howald C. et al, including Tanzer A, Derrien T, Guigo R. -In preparation-). We have had a long standing interest in the assessment of methods to delineate transcript structures and transcript abundances, and within GENCODE, we have organized RGASP 1 and 2 community benchmark experiments (http://www.gencodegenes.org/rgasp/).  In fact, in this past year we have again participated and helped organize the Sequence Mapping and Assembly Assessment Project (SMAAP -dnGASP and RGASP3-), whose results were presented in a workshop in Barcelona this last April, and consisted of a collaborative effort among researchers to compare and evaluate cutting edge methods and strategies for 2nd generation sequencing data mapping and assembly. This project involved several groups from around the world and allowed us to put to test the different RNASeq analysis tools we have been developing and which we described above (GEM mapper, Flux Capacitor and NextGeneid –ab initio geneid + RNASeq reads and split-mapped introns). Results will be made public and submitted for publication in the course of 2012.