Services and software
These are the different applications created and maintained by our group.
Grape is a pipeline for processing and analyzing RNA-Seq data
High throughput sequencing technologies generate vast amounts of data that require subsequent management, analysis and visualization.
The Grape RNAseq Analysis Pipeline Environment implements a set of workflows that allow for easy exploration of RNA-Seq data. Among other features, it enables the users to perform
- quality checks
- read mapping
- generation of expression and splicing statistics
The results are stored in a MySQL database and become immediately available through a RESTful back end server that is connected to a web application using the Google chart tools for display.
The documentation is hosted here: http://grapebuildout.readthedocs.org
Download the latest stable versions of Grape from here:
Follow the instructions in the README.txt
Knowles D, Röder M, Merkel A, Guigó R.
Bioinformatics. 2013 Mar 1;29(5):614-21. doi: 10.1093/bioinformatics/btt016. Epub 2013 Jan 17.
Check out the development version of Grape:
Then follow the steps in the README.txt
You need to have access to a
- MySQL database
Make sure to have the following standard programming languages installed:
The following Perl modules must be installed
You need to have the following module installed in Python:
To give you a preview of the statistical results produced by Grape, our lab has published a set of results for the following RNASeq projects:
New in Grape 1.9.5 (2013-10-21):
- update bootstrap.py from http://downloads.buildout.org/1/bootstrap.py
- Upgrade to grape.recipe.pipeline 1.1.16
- Better error message when the type parameter in the accession is neither fastq nor bam
New in Grape 1.9.4 (2013-02-13):
- Make TEMPLATE parameter optional, the right one is chosen by default now
- Detect wrong template for fastq and bam files.
New in Grape 1.9.3 (2013-02-12):
- Fix configuration checks for bam files
- Allow plus sign in parameter values
- Use a single bootstrap.py, and update the documentation
New in Grape 1.9.2 (2013-01-09):
- New MAXINTRONLENGTH option: Sets the maximum length of splits allowed during the postprocessing of the files generated by gem-2-sam removing the noise. The default is set to 50000, which is reasonable in mammals, however different species may require different settings. Setting it to 0 will remove this filter.
- Improve the documentation for grape.recipe.pipeline: https://graperecipepipeline.readthedocs.org/en/latest/
New in Grape 1.9.1 (2012-12-09):
- Fix bug that caused an uninitialized value in exon, junction and transcript tables when the initial fasta_file table was not present in the database.
New in Grape 1.9 (2012-11-11):
- The charts shown in the Raisin web application now have a better layout and a proper title.
- Fix Quick runs. Soft links were not made correctly, and some accession parameters are now filled in correctly as well.
- add FLUXMEM parameter to control the maximum number of GB of memory that the Flux can use
- add MIN_RECURSIVE_MAPPING_TRIM_LENGTH parameter that allows tuning the minimum length to which a read will be trimmed during the recursive mapping.
- check read labels for consistency
- Add the basic scripts to run the IDR. This is not incorporated yet to the pipeline itself, but the scripts can be used to run it.
- Servers now have a project_downloads and project_downloads_folder section that can be configured in servers/devel/buildout.cfg
New in Grape 1.8 (2012-10-09):
- Use new Cufflinks, version 2.0.2
- Upgrade to Grape pipeline 6.5
- Allow for the running of the start script with only species, genome, annotation and read length specified appart from a list of one or two files.
- Set the number of CPUs used by fastqc to one.
- new dependency on grape.recipe.pipeline to share validation code with the Raisin
- Fix a bug that prevented the correct running of the Flux when the read Ids came from HiSeq
- Add verbose to the mysqlimport statement in the build_exon_junctions.RNAseq.pl
- Install the pre version of GEM in grape.recipe.pipeline 1.1.9, with binaries prefixed with "next".
- Download packages from PyPI instead of from the SVN
- add .downloads and .eggs folders
- Raisin web server
- The download paths and the project folders are now configured in the buildout.cfg
- Remove pickle caching code
- Remove code previously used for dumping resources
- Move dumps folder to the top
- improve .gitignore
- pin MySQL-python = 1.2.3
New in Grape 1.7 (2012-07-25):
- Fix a bug that prevented the pipeline from building the inclusion exclusion table
- Speed up the recursive mapping part of the pipeline
- The output from the Flux capacitor is not deleted any more, making it available for further analysis
- In the Raisin web application, links are now shorter when pointing to pages with tabs. For example, it is not necessary to add /tab/experiments to URLs any more, if the experiments tab is the default tab.
New in Grape 1.6 (2012-07-10):
- Now creates the var/log folder needed when starting raisin with supervisord
- Raisin can now be installed even if some annotation information is missing
- Now correctly gets all the scores in qualities and ambiguous
- Parsing reads is fixed
- HiSEq read IDs are now handled
New in Grape 1.5 (2012-07-06):
- Installs and integrates FastQC
- Default parameters to make configuration easier. When no project parameters are given, use read_length. When no project user is given, use anonymous
New in Grape 1.4 (2012-06-27):
Install Cufflinks 2.0.1 binaries and use it for detecting novel transcripts
Use FastQC for quality control
Calculate gene and exon RPKM now from the Flux Capacitor results
New in Grape 1.3 (2012-05-18):
Integrating version 6.0 of the pipeline that now depends on a new version of the Flux-Capacitor: 1.0 RC2.
Speed improvements: The pipeline now relies less on the overlap tool by using information already included in the BAM files.
New in Grape 1.2 (2012-04-20):
The automatic installation of dependencies has changed. Before, they were taken from the SVN using mr.developer. Now that releases are available for the packages needed by the pipeline, they are downloaded from our web server using the hexagonit.recipe.download recipe, or taken from PyPI (See grape.recipe.pipeline) using zc.recipe.egg.
New in Grape 1.1.2 (2012-03-30):
Moved the installation instructions to INSTALL.txt and added a README.txt to pipelines/Quick.
New in Grape 1.1.1 (2012-03-30):
Grape 1.1.1 includes a better README.txt that will guide you throught the installation process.
New in Grape 1.1 (2012-03-29):
Grape 1.1 allows you to quickly analyse one set of RNAseq reads without a lot of configuration overhead.
We provide annotations and genomes that are known to work perfectly for the following species:
- Homo sapiens
- Caenorhabditis elegans (Coming soon)
- Mus musculus
- Drosophila Melanogaster
All versions of Grape:
AcE is a program to aid gene prediction accuracy evaluation. It uses GFF format to make it easy to convert gene prediction results into an analyzable format. Novel features include isoform accuracy evaluation from either the annotated gene or gene prediction perspective or both at the same time. Masking of genomic sequence which has unknown features allows gene predictions in annotated regions to be analyzed in a genomic context. Test sets, such as an artificial sequence test set or genomic context test set, can be generated by selecting specified annotated sequences from a master set.
BioMoby Web Services and Workflows
The bioMoby project aims to provide bioinformatics resources through the web. In this regard, we have designed and implemented a set of Web services that are compliant with bioMoby specifications. We have been focusing on providing genome analysis resources. These resources are of various types:
- Sequence analysis
- Gene prediction (i.e. runGeneIDGFF, runSGP2GFF)
- Signal predicition (transcription regulatory elements or splicing elements) [i.e. runMatScanGFF, runMetaAlignmentGFF]
- ESTs assembly
- Sequence data retrieval (promoter sequences from Ensembl database)
- Data format conversion
A more complete list of Web resources can be found in the WEB SERVICES section We have also set up some pipelines of analysis to illustrate the use of these Web resources. Here are the main pipelines of computational analysis that were implemented:
- Promoter analysis
- ESTs assembly
More information about our pipelines of analysis can be found in the WORKFLOWS section.
compmerge is a program that tries to solve the same problem as cuffmerge.It is not limited to cufflinks models and transcripts, but can work with any .gtf file.It merges the spliced transcripts that have a compatible intron structure and merges the monoexonic transcripts based on simple stranded overlap.The output is a .gtf file of merged transcripts.
After getting the compmerge executable, type:
without any argument to get the help.
This will give you all the possible options of compmerge.
Do not hesitate to contact sarahqd at gmail dot com for more information.
geneid is a program to predict genes along a DNA sequence in a large set of organisms. While its accuracy compares favorably to that of other existing tools, geneid is more efficient in terms of speed and memory usage and it offers some rudimentary support to integrate predictions from multiple source.
You will also find whole genome annotations for different species obtained with geneid in our "Gene Predictions" web pages.
Visualizing pair-wise alignments with annotated axes from GFF files.
We are proud to announce the new version of gff2aplot, that has been re-implemented in perl. Visit the program's web page downloading section to obtain v2.0. You can obtain the full distribution tarball from there.
Although the "gff2aplot User's Manual" is not finished yet, you can start using it as we have written several HTML tutorials that will introduce you in how to use this program. We hope you will enjoy them.
A snapshots web page is also available, listing few examples of what can be done with gff2aplot.
Obtaining plots to compare genomic sequences and/or sources from GFF files. Last available version is 0.98. Get the PostScript version of "gff2ps Users Manual" (v0.96). A Web Server is also available at Institut Pasteur thanks to Catherine Letondal. A new section has been created: HTML HOWTOs for gff2ps. The first two HOWTOs were also added: "Comparing sources with gff2ps" and "Visualizing PostScript output from gff2ps". We hope you will find them useful.
gff2ps was used to obtain the six chromosome arm plots (X, 2L, 2R, 3L, 3R and 4) appearing in the "Coding content of the fly genome" genome map (figure 4), included as a poster in "The Genome Sequence of Drosophila melanogaster" [Adams et al. Science 287(5461):2185-2195(2000)].
We have produced the map of the Human Genome with gff2ps. 22 autosomic, X and Y chromosomes were displayed in a big poster appearing as the figure 1 of "The Sequence of the Human Genome" [Venter et al. Science 291(5507):1304-1351 (2001)]. The single chromosome pictures can be accessed from here to visualize the web version of the "Annotation of the Celera Human Genome Assembly" poster.
gff2ps has achieved another genome landmark. The mosquito genome annotation for five chromosome arms (2L, 2R, 3L, 3R and X) has been summarized into a two-sided five-pages foldout included as the figure 1 of "The Genome Sequence of the Malaria Mosquito Anopheles gambiae" [Holt et al. Science 298(5591):129-149 (2002)], available from the "Annotation of the Anopheles gambiae genome sequence" web page.
meta is a program to produce and to align the TF-maps of two gene promoter regions. meta is very useful to characterize promoter regions from orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.
mmeta is a program to produce and to align the TF-maps of multiple promoter regions. mmeta is very powerful to characterize promoter regions from multiple orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.
overlap is a program that computes the overlap between two sets of genomic features. More precisely it takes two gff files of genomic features as input and for each feature of the first set, says whether it is overlapped by a feature of the second set (basic mode, however more and more precise information can be retrieved).
After getting the overlap executable, type:
without any argument to get the help.
This will give you all the possible options of overlap. Basically the output will be equal to the first input file (file1) with additional information about the overlap of its features with the features of the second file (file2). There are 4 basic modes:- Mode 0 (option -m 0) is to report boolean overlap: 1 if the feature of file1 is overlapped by a feature of file2, 0 otherwise.- Mode 1 (option -m 1) is to report quantitative overlap: the number of file2 features overlapping a file1 feature. - Negative mode (option -m -1 for example) is to report the list of coordinates of file2 features overlapping a file1 feature.- Value mode (option -m n where n>=10 and n is even) is to report the list of values located in field 10 in file2 associated to file2 features overlapping a file1 feature.
You can also ask for inclusion instead of general overlap, this is the option -i, or make a stranded overlap, this is the option -st.
Do not hesitate to contact sarahqd at gmail dot com for more information.
PATRONUS (from "PATtern Recognition by Optimized Numerical Universal Scoring") is a program designed to compute in a very fast way the exact probability of observing a given number of occurrences of a simple motif (that is, a continuous word without gaps) in a sequence. Its intended scope is the analysis of very long biological sequences, like chromosomes or whole genomes of complex organisms. The probability is computed on the basis of the Markovian statistics of order m for the sequence, that is the recorded number of the occurrences of all the submotifs of length m + 1 in the sequence. Contrary to what many people believe, computing such a probability for a generic motif is a computationally demanding task, mainly because motifs can overlap in non-trivial ways.
A detailed description of both the PATRONUS algorithm and its excellent performance can be found here.
SECISaln will predict a SECIS element in the query sequence, split it into its constituent parts and align these against a precompiled database of eukaryotic SECIS elements. The user can choose whether the database sequences are sorted by protein family or by species, thereby offering the possibility of comparing the submitted sequence to other, known SECISes. In addition, SECISaln returns a graphical image of the predicted structure of the user-submitted sequence as well as a multiple structural alignment of all SECIS elements of that type already present in the database.
SECISearch3 and Seblastian: prediction of SECIS elements and selenoproteins
Selenoproteins are proteins containing an uncommon amino acid selenocysteine (Sec). Sec is inserted by a specific translational machinery that recognizes a stem-loop structure, the SECIS element, at the 3′ UTR of selenoprotein genes and recodes a UGA codon within the coding sequence. As UGA is normally a translational stop signal, selenoproteins are generally misannotated and designated tools have to be developed for this class of proteins.
In this webserver (go to http://seblastian.crg.es) we provide public access to two new computational methods for selenoprotein identification and analysis: SECISearch3 replaces its predecessor SECISearch as a tool for prediction of eukaryotic SECIS elements. Seblastian is a new method for selenoprotein gene detection that uses SECISearch3 and then predicts selenoprotein sequences encoded upstream of SECIS elements. Seblastian is able to both identify known selenoproteins and predict new selenoproteins.
A open-access paper describing these methods, including their validation and the prediction of new selenoproteins in many eukaryotic lineages, was published in Nucleic Acid Research: http://nar.oxfordjournals.org/content/early/2013/06/19/nar.gkt550.full
This project is the result of a collaboration with Vadim Gladyshev's lab in Harvard (http://gladyshevlab.bwh.harvard.edu/).
Selenoprofiles is a pipeline for profile-based protein finding in genomes.
Provided one or more protein alignments, it scans a target genome (or any other nucleotide database) and reports the gene structures of homologous genes. It can be used to easily characterize any protein family of interest across a massive amount of sequenced genomes, allowing a finely tuned filtering of results. Or it can be used with comprehensive sets of input profiles, in order to completely annotate by homology one or more genomes. The pipeline runs internally blast (psitblastn), exonerate (p2g mode) and genewise, combining them into a final set of non-overlapping predictions for all profiles.
Selenoprofiles is highly flexible. Even unexperienced user can edit the procedures of filtering, adapting them for each profile. The advanced user can plug-in its own code to customize the internal procedures of labelling, filtering, solving overlaps, outputing. It is also possible to write code to annotate genomic features of the predicted gene, such as protein or RNA motifs, which are stored and added to the native selenoprofiles output.
Although the program offers a variety of filtering methods, the default filter is quite effective. For each candidate, a measure of its similarity with the sequences in the profile is computed (AWSI score, see manual). The resulting score is compared to the distribution of this measure within the profile sequences. In this way, very conserved alignment profiles allow only highly similar sequences to pass the filter and be output.
Selenoprofiles can be used with any input protein family, but we initially developed it for selenoproteins. These peculiar proteins contain a selenocysteine, the 21st amino acid, which is inserted in correspondence to specific UGA codons, normally signalling translation termination. In selenoprotein transcripts we find specific secondary structures (SECIS elements), which targets a specific UGA to be read as Sec instead that as a stop. Since selenoproteins possess this peculiar feature (recoding of specific stop codons), normal gene prediction programs fail to predict them. Selenoprofiles in contrast can correctly predict selenoprotein genes, by using technical expedients to align selenocysteine positions. Selenoprofiles includes built-in profiles for selenoproteins and other proteins related to selenocysteine, allowing out-of-the-box prediction of these families.
Mariotti M, Guigó R. Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes. Bioinformatics. 2010 Nov 1;26(21):2656-63. Epub 2010 Sep 21
(Note that the paper refers to outdated version 1)
All "slave" programs run by selenoprofiles must be installed by user:
- Blast: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ [blastall legacy suite required (2.2.2x). Blast+ programs are not supported yet]
- Exonerate: http://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
- Genewise: ftp://ftp.ebi.ac.uk/pub/software/unix/wise2/
- Mafft: http://mafft.cbrc.jp/alignment/software/
If you experience problems in the installation of these programs, this page may help you. Selenoprofiles can be installed on any unix system with python 2.6 or newer. A python command line installer (install_selenoprofiles.py) is provided inside the installation package that you can find below. The tar package contains also the manual which is linked below.
Alternatively you may download the repository from github. With git installed, type in a terminal:
git clone https://github.com/marco-mariotti/selenoprofiles
After obtaining the package, follow the instructions on the README file included in the package. To be used with the built-in profiles, Selenoprofiles requires a few accesory files. These include the large ncbi nr database. This used with blastp to query with sequences of candidates, to assign a filtering score based on the titles of the blastp matches. If you plan to use selenoprofiles only with your own custom profiles, you can skip its download by performing a minimal installation with:
python install_selenoprofiles.py -min
Output and accessory programs
Several output formats are available, such as gff or fasta for nucleotide or protein sequences. The manual explains how to activate the built-in output types, and how to customize output by adding information of interesdt. By default, selenoprofiles produce only two type of output files: a fasta alignment of all predictions aligned to their corresponding profile, and a human readable p2g file showing the alignment gene structure and sequence (find an example of p2g file in here). The selenoprofiles package then contains a few additional programs, suited for projects aimed at searching certain protein families in many target species. The program selenoprofiles_join_alignments retrieves and merge into a single alignment all the results in different species. Then, selenoprofiles_tree_drawer allows their visualization in the phylogenetic tree of the target species. This program require the installation of the python tree environment ete2: http://ete.cgenomics.org/. Selenoprofiles_tree_drawer can generate images like those below.
Abstract mode (-a):
If you need help with selenoprofiles, do not hesitate to contact me by email: marco.mariotti at crg.eu
Selenoproteins are a group of proteins that contain selenocysteine (Sec), a rare amino acid inserted co-translationally into the protein chain. The Sec codon is UGA, which is normally a stop codon. In selenoproteins UGA is recoded to Sec in presence of specific signals on selenoprotein gene transcripts. Due to the dual role of the UGA codon, gene prediction programs fail to predict correctly selenoproteins. Selenoprofiles is an homology-based in silico tool able to scan genomes for members of the known selenoprotein families, thus finding both selenoproteins and cysteine homologues. Selenoprofiles is built in python, and it internally runs psitblastn, exonerate, genewise and SECISearch.
Selenoprofiles is tuned to search for selenoprotein genes, and comes out-of-the box with profile alignments for each known selenoprotein and selenocysteine-related family (Note: profiles will be released soon. The current release contain only the program and a single profile for example).
Selenoprofiles can be used to search for any protein family (also non-selenoprotein), given an input profile alignment. This pipeline combines standard gene prediction tools to provide a clean and fast way to scan genomes for protein families, and provides a wide repertoire of output formats which can also be extended by the user. The program allows for a deep level of customization, and provides many built-in methods to filter spurious hits.
NOTE: This page describes selenoprofiles version 2.2. A newer version of this program is available here. Version 1 is no longer maintained.
This version features major improvements on the previous ones, such as:
- improved workflow control
- prediction by blast can be output, allowing use of selenoprofiles in bacterial genomes (exonerate and genewise are eukaryote specific)
- lazy computing implemented
- pre-clustering of the profile alignment: multiple blast are run if the profile is highly variable
- an SQLite database is used to store results, allowing to search for a high number of families without producing an enourmous amount of files, since they can be deleted at the end of computation
- improved customization of the options used with the slave programs, which can potentially be different for each profile
- improved filtering of results: all filtering procedures are defined as pieces of python code which are run internally in selenoprofiles. Several methods useful for filtering are provided. Filtering can be customized for each family
- intra-family and inter-family redundancy of results is removed
- tag blast and gene ontology extensions implemented for filtering (see manual)
Tools for graphical representation of selenoprofiles results are under development and will be released in the next few months.
Download the last version of selenoprofiles manual here: http://genome.crg.es/~mmariotti/selenoprofiles_manual.2.2.pdf
For selenoprofiles to work, all the slave programs that it utilizes must be already installed in your machine (blastall, exonerate, genewise). You will also need some external python modules if you want to use all its functionalities. These additional modules are needed if you want to scan genomes for selenoproteins, but may be omitted if you want to scan for your protein family of interest. In this page you can find help to install the slave programs and the additional python modules.
To install selenoprofiles, download this tarball. Then, follow the instructions in the README.
Selenoprofiles was published in Bioinformatics. To read the article or access the online data, check this page. Please cite:
Mariotti M, Guigo R - Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes.Bioinformatics. 2010 Nov 1;26(21):2656-63. Epub 2010 Sep 21
sgp2 is a program to predict genes by comparing anonymous genomic sequences from different species. It combines tblastx, a sequence similarity search program, with geneid, an ab initio gene prediction program.
You will also find whole genome annotations for different species obtained with sgp2 in our "Gene Predictions" web pages.
SymCurv is a computational ab initio method for nucleosome positioning prediction. It is based on the structural property of natural nucleosome forming sequences, to be symmetrically curved around a local minimum of curvature. The method takes as input the primary DNA sequence, calculates the expected curvature from which it deduces possible centers of nucleosomal sequences, by imposing symmetry constraints. SymCurv's performance is comparable to existing tools but offers the additional advantages of predicting nucleosome positions under two assumed-states (stationary and dynamic) providing insight on the remodelling potential of nucleosomes of possible regulatory function.
The Flux Capacitor
The Flux Capacitor predicts abundances for transcript molecules and alternative splicing events from RNAseq experiments. Additionally, there is a simulation pipeline that is capable to simulate whole transcriptome sequencing experiments.
The GEM (GEnome Multi-tool) Library
The GEM (GEnome Multi-tool) Library is a set of very optimized tools for indexing/querying huge genomes/files. Provided so far are a very fast exhaustive mapper (the GEM mapper), an unconstrained split mapper (the GEM split mapper), and a very fast program to compute genome mappability (the GEM mappability).
You will be able to have access to the latest version of the GEM Aligner Library by clicking on the link below: