Services and software

These are the different applications created and maintained by our group.

 

AcE

AcE is a program to aid gene prediction accuracy evaluation. It uses GFF format to make it easy to convert gene prediction results into an analyzable format. Novel features include isoform accuracy evaluation from either the annotated gene or gene prediction perspective or both at the same time. Masking of genomic sequence which has unknown features allows gene predictions in annotated regions to be analyzed in a genomic context. Test sets, such as an artificial sequence test set or genomic context test set, can be generated by selecting specified annotated sequences from a master set.

Our version is no longer maintained, from now on you can get the latest AcE version at Bioinformatics.org. You can contact with William.S.Hayes from this link.

AStalavista

AStalavista, the Alternative Splicing Trascriptional Landscape Visualization Tool and more, retrieves all alternative splicing events from generic transcript annotations.

BioMoby Web Services and Workflows

The bioMoby project aims to provide bioinformatics resources through the web. In this regard, we have designed and implemented a set of Web services that are compliant with bioMoby specifications. We have been focusing on providing genome analysis resources. These resources are of various types:

  • Sequence analysis
    • Gene prediction (i.e. runGeneIDGFF, runSGP2GFF)
    • Signal predicition (transcription regulatory elements or splicing elements) [i.e. runMatScanGFF, runMetaAlignmentGFF]
    • ESTs assembly
  • Sequence data retrieval (promoter sequences from Ensembl database)
  • Data format conversion

A more complete list of Web resources can be found in the WEB SERVICES section We have also set up some pipelines of analysis to illustrate the use of these Web resources. Here are the main pipelines of computational analysis that were implemented:

  • Promoter analysis
  • ESTs assembly

More information about our pipelines of analysis can be found in the WORKFLOWS section.

compmerge

compmerge is a program that tries to solve the same problem as cuffmerge.It is not limited to cufflinks models and transcripts, but can work with any .gtf file.It merges the spliced transcripts that have a compatible intron structure and merges the monoexonic transcripts based on simple stranded overlap.The output is a .gtf file of merged transcripts.

EXECUTE INSTRUCTIONS:

After getting the compmerge executable, type:

./compmerge

without any argument to get the help.

This will give you all the possible options of compmerge. 

 Do not hesitate to contact sarahqd at gmail dot com for more information.

geneid

geneid is a program to predict genes along a DNA sequence in a large set of organisms. While its accuracy compares favorably to that of other existing tools, geneid is more efficient in terms of speed and memory usage and it offers some rudimentary support to integrate predictions from multiple source.

You will also find whole genome annotations for different species obtained with geneid in our "Gene Predictions" web pages.

gff2aplot

Visualizing pair-wise alignments with annotated axes from GFF files.

We are proud to announce the new version of gff2aplot, that has been re-implemented in perl. Visit the program's web page downloading section to obtain v2.0. You can obtain the full distribution tarball from there.

Although the "gff2aplot User's Manual" is not finished yet, you can start using it as we have written several HTML tutorials that will introduce you in how to use this program. We hope you will enjoy them.

A snapshots web page is also available, listing few examples of what can be done with gff2aplot.

gff2ps

Obtaining plots to compare genomic sequences and/or sources from GFF files. Last available version is 0.98. Get the PostScript version of "gff2ps Users Manual" (v0.96). A Web Server is also available at Institut Pasteur thanks to Catherine Letondal. A new section has been created: HTML HOWTOs for gff2ps. The first two HOWTOs were also added: "Comparing sources with gff2ps" and "Visualizing PostScript output from gff2ps". We hope you will find them useful.

gff2ps was used to obtain the six chromosome arm plots (X, 2L, 2R, 3L, 3R and 4) appearing in the "Coding content of the fly genome" genome map (figure 4), included as a poster in "The Genome Sequence of Drosophila melanogaster" [Adams et al. Science 287(5461):2185-2195(2000)].

We have produced the map of the Human Genome with gff2ps. 22 autosomic, X and Y chromosomes were displayed in a big poster appearing as the figure 1 of "The Sequence of the Human Genome" [Venter et al. Science 291(5507):1304-1351 (2001)]. The single chromosome pictures can be accessed from here to visualize the web version of the "Annotation of the Celera Human Genome Assembly" poster.

gff2ps has achieved another genome landmark. The mosquito genome annotation for five chromosome arms (2L, 2R, 3L, 3R and X) has been summarized into a two-sided five-pages foldout included as the figure 1 of "The Genome Sequence of the Malaria Mosquito Anopheles gambiae" [Holt et al. Science 298(5591):129-149 (2002)], available from the "Annotation of the Anopheles gambiae genome sequence" web page.

Grape

High throughput sequencing technologies generate vast amounts of data that require subsequent management, analysis and visualization.

The Grape RNAseq Analysis Pipeline Environment implements a set of workflows that allow for easy exploration of RNAseq data. Among other features, it enables the users to perform

  • quality checks
  • read mapping
  • generation of expression and splicing statistics

The results are stored in a MySQL database and become immediately available through a RESTful back end server that is connected to a web application using the Google chart tools for display.

Download

Download the latest stable versions of Grape from here:

grape.buildout-1.3.tar.gz

grape.buildout-1.2.tar.gz

grape.buildout-1.1.2.tar.gz

grape.buildout-1.1.1.tar.gz

grape.buildout-1.1.tar.gz

grape.buildout-1.0.tar.gz

Follow the instructions in the README.txt

Development

Check out the development version of Grape:

svn co --username rnaguest --password rnaguest svn://svn.crg.es/big/grape/grape.buildout/trunk grape.buildout

Then follow the steps in the README.txt

Dependencies

You need to have access to a

  • MySQL database

Make sure to have the following standard programming languages installed:

  • Perl
  • Java
  • R
  • Python

The following Perl modules must be installed

  • DBI
  • DBD::mysql
  • Bio::DB::Fasta
  • Bundle::BioPerl
  • Bio::Seq
  • Bio::DB::Sam

You need to have the following module installed in Python:

  • virtualenv

Demo

To give you a preview of the statistical results produced by Grape, our lab has published a set of results for the following RNASeq projects:

New in Grape 1.3:

Integrating version 6.0 of the pipeline that now depends on a new version of the Flux-Capacitor: 1.0 RC2.

Speed improvements: The pipeline now relies less on the overlap tool by using information already included in the BAM files.

New in Grape 1.2:

The automatic installation of dependencies has changed. Before, they were taken from the SVN using mr.developer. Now that releases are available for the packages needed by the pipeline, they are downloaded from our web server using the hexagonit.recipe.download recipe, or taken from PyPI (See grape.recipe.pipeline) using zc.recipe.egg.

 

New in Grape 1.1.2:

Moved the installation instructions to INSTALL.txt and added a README.txt to pipelines/Quick.

New in Grape 1.1.1:

Grape 1.1.1 includes a better README.txt that will guide you throught the installation process.

New in Grape 1.1:

Grape 1.1 allows you to quickly analyse one set of RNAseq reads without a lot of configuration overhead.

We provide annotations and genomes that are known to work perfectly for the following species:

  • Homo sapiens
  • Caenorhabditis elegans (Coming soon)
  • Mus musculus
  • Drosophila Melanogaster

 

meta

meta is a program to produce and to align the TF-maps of two gene promoter regions. meta is very useful to characterize promoter regions from orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.

mmeta

mmeta is a program to produce and to align the TF-maps of multiple promoter regions. mmeta is very powerful to characterize promoter regions from multiple orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.

overlap

overlap is a program that computes the overlap between two sets of genomic features. More precisely it takes two gff files of genomic features as input and for each feature of the first set, says whether it is overlapped by a feature of the second set (basic mode, however more and more precise information can be retrieved).

EXECUTE INSTRUCTIONS:

After getting the overlap executable, type:

./overlap

without any argument to get the help.

This will give you all the possible options of overlap.  Basically the output will be equal to the first input file (file1) with additional information about the overlap of its features with the features of the second file (file2).  There are 4 basic modes:- Mode 0 (option -m 0) is to report boolean overlap: 1 if the feature of file1 is overlapped by a feature of file2, 0 otherwise.- Mode 1 (option -m 1) is to report quantitative overlap: the number of file2 features overlapping a file1 feature. - Negative mode (option -m -1 for example) is to report the list of coordinates of file2 features overlapping a file1 feature.- Value mode (option -m n where n>=10 and n is even) is to report the list of values located in field 10 in file2 associated to file2 features overlapping a file1 feature.

You can also ask for inclusion instead of general overlap, this is the option -i, or make a stranded overlap, this is the option -st.

Do not hesitate to contact sarahqd at gmail dot com for more information.

PATRONUS

PATRONUS (from "PATtern Recognition by Optimized Numerical Universal Scoring") is a program designed to compute in a very fast way the exact probability of observing a given number of occurrences of a simple motif (that is, a continuous word without gaps) in a sequence. Its intended scope is the analysis of very long biological sequences, like chromosomes or whole genomes of complex organisms. The probability is computed on the basis of the Markovian statistics of order m for the sequence, that is the recorded number of the occurrences of all the submotifs of length m + 1 in the sequence. Contrary to what many people believe, computing such a probability for a generic motif is a computationally demanding task, mainly because motifs can overlap in non-trivial ways.

A detailed description of both the PATRONUS algorithm and its excellent performance can be found here.

project

project is a program that projects genomic features onto their sequences. Please contact Sarah Djebali (sarah dot djebali at crg dot es for any question).

SECISaln

SECISaln will predict a SECIS element in the query sequence, split it into its constituent parts and align these against a precompiled database of eukaryotic SECIS elements. The user can choose whether the database sequences are sorted by protein family or by species, thereby offering the possibility of comparing the submitted sequence to other, known SECISes. In addition, SECISaln returns a graphical image of the predicted structure of the user-submitted sequence as well as a multiple structural alignment of all SECIS elements of that type already present in the database.

SECISaln

Selenoprofiles

Selenoproteins are a group of proteins that contain selenocysteine (Sec), a rare amino acid inserted co-translationally into the protein chain. The Sec codon is UGA, which is normally a stop codon. In selenoproteins UGA is recoded to Sec in presence of specific signals on selenoprotein gene transcripts. Due to the dual role of the UGA codon, gene prediction programs fail to predict correctly selenoproteins. Selenoprofiles is an homology-based in silico tool able to scan genomes for members of the known selenoprotein families, thus finding both selenoproteins and cysteine homologues. Selenoprofiles is built in python, and it internally runs psitblastn, exonerate, genewise and SECISearch.

Selenoprofiles is tuned to search for selenoprotein genes, and comes out-of-the box with profile alignments for each known selenoprotein and selenocysteine-related family (Note: profiles will be released soon. The current release contain only the program and a single profile for example).

Selenoprofiles can be used to search for any protein family (also non-selenoprotein), given an input profile alignment. This pipeline combines standard gene prediction tools to provide a clean and fast way to scan genomes for protein families, and provides a wide repertoire of output formats which can also be extended by the user. The program allows for a deep level of customization, and provides many built-in methods to filter spurious hits.

Selenoprofiles version 2.2 is now available. Version 1 is no longer maintained -- if you need help with version 1, contact us by mail. This version features major improvements on the previous ones, such as:

  • improved workflow control
  • prediction by blast can be output, allowing use of selenoprofiles in bacterial genomes (exonerate and genewise are eukaryote specific)
  • lazy computing implemented
  • pre-clustering of the profile alignment: multiple blast are run if the profile is highly variable
  • an SQLite database is used to store results, allowing to search for a high number of families without producing an enourmous amount of files, since they can be deleted at the end of computation
  • improved customization of the options used with the slave programs, which can potentially be different for each profile
  • improved filtering of results: all filtering procedures are defined as pieces of python code which are run internally in selenoprofiles. Several methods useful for filtering are provided. Filtering can be customized for each family
  • intra-family and inter-family redundancy of results is removed
  • tag blast and gene ontology extensions implemented for filtering (see manual)

Tools for graphical representation of selenoprofiles results are under development and will be released in the next few months.

MANUAL

Download the last version of selenoprofiles manual here: http://genome.crg.es/~mmariotti/selenoprofiles_manual.pdf 

INSTALLATION:

For selenoprofiles to work, all the slave programs that it utilizes must be  already installed in your machine (blastall, exonerate, genewise). You will also need some external python modules if you want to use all its functionalities. These additional modules are needed if you want to scan genomes for selenoproteins, but may be omitted if you want to scan for your protein family of interest. In this page you can find help to install the slave programs and the additional python modules.

You can either use svn to fetch the latest selenoprofiles package:

  svn co --username guest --password guest svn://svn.crg.es/big/selenoprofiles/trunk selenoprofiles

or download this tarball. Then, follow the instructions in the README.

CITATION:

Selenoprofiles was published in Bioinformatics. To read the article or access the online data, check this page. Please cite:

Mariotti M, Guigo R - Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes.Bioinformatics. 2010 Nov 1;26(21):2656-63. Epub 2010 Sep 21

 

Contact: marco.mariotti@crg.eu

sgp2

sgp2 is a program to predict genes by comparing anonymous genomic sequences from different species. It combines tblastx, a sequence similarity search program, with geneid, an ab initio gene prediction program.

You will also find whole genome annotations for different species obtained with sgp2 in our "Gene Predictions" web pages.

SymCurv

SymCurv is a computational ab initio method for nucleosome positioning prediction. It is based on the structural property of natural nucleosome forming sequences, to be symmetrically curved around a local minimum of curvature. The method takes as input the primary DNA sequence, calculates the expected curvature from which it deduces possible centers of nucleosomal sequences, by imposing symmetry constraints. SymCurv's performance is comparable to existing tools but offers the additional advantages of predicting nucleosome positions under two assumed-states (stationary and dynamic) providing insight on the remodelling potential of nucleosomes of possible regulatory function.

The Flux Capacitor

The Flux Capacitor predicts abundances for transcript molecules and alternative splicing events from RNAseq experiments. Additionally, there is a simulation pipeline that is capable to simulate whole transcriptome sequencing experiments.

The GEM (GEnome Multi-tool) Library

The GEM (GEnome Multi-tool) Library is a set of very optimized tools for indexing/querying huge genomes/files. Provided so far are a very fast exhaustive mapper (the GEM mapper), an unconstrained split mapper (the GEM split mapper), and a very fast program to compute genome mappability (the GEM mappability).

Syndicate content