Services and software
Here a description of the diferent available services.
AcE
AcE is a program to aid gene prediction accuracy evaluation. It uses GFF format to make it easy to convert gene prediction results into an analyzable format. Novel features include isoform accuracy evaluation from either the annotated gene or gene prediction perspective or both at the same time. Masking of genomic sequence which has unknown features allows gene predictions in annotated regions to be analyzed in a genomic context. Test sets, such as an artificial sequence test set or genomic context test set, can be generated by selecting specified annotated sequences from a master set.
Our version is no longer maintained, from now on you can get the latest AcE version at Bioinformatics.org. You can contact with William.S.Hayes from this link.
AStalavista
AStalavista, the Alternative Splicing Trascriptional Landscape Visualization Tool and more, retrieves all alternative splicing events from generic transcript annotations.
BioMoby Web Services and Workflows
The bioMoby project aims to provide bioinformatics resources through the web. In this regard, we have designed and implemented a set of Web services that are compliant with bioMoby specifications. We have been focusing on providing genome analysis resources. These resources are of various types:
- Sequence analysis
- Gene prediction (i.e. runGeneIDGFF, runSGP2GFF)
- Signal predicition (transcription regulatory elements or splicing elements) [i.e. runMatScanGFF, runMetaAlignmentGFF]
- ESTs assembly
- Sequence data retrieval (promoter sequences from Ensembl database)
- Data format conversion
A more complete list of Web resources can be found in the WEB SERVICES section We have also set up some pipelines of analysis to illustrate the use of these Web resources. Here are the main pipelines of computational analysis that were implemented:
- Promoter analysis
- ESTs assembly
More information about our pipelines of analysis can be found in the WORKFLOWS section.
DeathBase
Deathbase is a database of proteins involved in cell death. It compiles relevant data on the function, structure and evolution of proteins involved in apoptosis and other forms of cell death in several organisms. Information contained in this database is subjected to manual curation. You can contribute to maintain the DeathBase by editing the wikipage for any protein.
Environment for Tree Exploration (ETE)
ETE is a python programming toolkit that assists in the automated manipulation, analysis and visualization of hierarchical trees. Besides a broad set of tree handling options, ETE provides specific methods to analyze phylogenetic and clustering trees. It also supports large tree data structures, node annotation, independent editing and analysis of tree partitions, and the association of trees with external data such as multiple sequence alignments or numerical matrices.
ETE is available at http://ete.cgenomics.org
geneid
geneid is a program to predict genes along a DNA sequence in a large set of organisms. While its accuracy compares favorably to that of other existing tools, geneid is more efficient in terms of speed and memory usage and it offers some rudimentary support to integrate predictions from multiple source.
You will also find whole genome annotations for different species obtained with geneid in our "Gene Predictions" web pages.
gff2aplot
Visualizing pair-wise alignments with annotated axes from GFF files.
We are proud to announce the new version of gff2aplot, that has been re-implemented in perl. Visit the program's web page downloading section to obtain v2.0. You can obtain the full distribution tarball from there.
Although the "gff2aplot User's Manual" is not finished yet, you can start using it as we have written several HTML tutorials that will introduce you in how to use this program. We hope you will enjoy them.
A snapshots web page is also available, listing few examples of what can be done with gff2aplot.
gff2ps
Obtaining plots to compare genomic sequences and/or sources from GFF files. Last available version is 0.98. Get the PostScript version of "gff2ps Users Manual" (v0.96). A Web Server is also available at Institut Pasteur thanks to Catherine Letondal. A new section has been created: HTML HOWTOs for gff2ps. The first two HOWTOs were also added: "Comparing sources with gff2ps" and "Visualizing PostScript output from gff2ps". We hope you will find them useful.
gff2ps was used to obtain the six chromosome arm plots (X, 2L, 2R, 3L, 3R and 4) appearing in the "Coding content of the fly genome" genome map (figure 4), included as a poster in "The Genome Sequence of Drosophila melanogaster" [Adams et al. Science 287(5461):2185-2195(2000)].
We have produced the map of the Human Genome with gff2ps. 22 autosomic, X and Y chromosomes were displayed in a big poster appearing as the figure 1 of "The Sequence of the Human Genome" [Venter et al. Science 291(5507):1304-1351 (2001)]. The single chromosome pictures can be accessed from here to visualize the web version of the "Annotation of the Celera Human Genome Assembly" poster.
gff2ps has achieved another genome landmark. The mosquito genome annotation for five chromosome arms (2L, 2R, 3L, 3R and X) has been summarized into a two-sided five-pages foldout included as the figure 1 of "The Genome Sequence of the Malaria Mosquito Anopheles gambiae" [Holt et al. Science 298(5591):129-149 (2002)], available from the "Annotation of the Anopheles gambiae genome sequence" web page.
meta
meta is a program to produce and to align the TF-maps of two gene promoter regions. meta is very useful to characterize promoter regions from orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.
MetaPhOrs
MetaPhOrs is a public repository of phylogeny-based orthology and paralogy predictions that were computed using resources available in seven popular homology prediction services (PhylomeDB, EnsemblCompara, EggNOG, OrthoMCL, COG, Fungal Orthogroups, and TreeFam). Currently above 306 millions of unique homologous protein pairs are deposited in MetaPhOrs database. These predictions were retrieved from 705 123 phylogenetic trees for 829 genomes. For each prediction, MetaPhOrs provides a Consistency Score and Evidence Level describing its goodness, together with number of trees and links to their source databases.
Metazoan mt-tRNAs
A database of tetrapod mitochondrial tRNAs. The database includes secondary structure based alignments of mt-tRNAs from 277 species of completely sequences tetrapod mitochondrial genomes as of 2007.
mmeta
mmeta is a program to produce and to align the TF-maps of multiple promoter regions. mmeta is very powerful to characterize promoter regions from multiple orthologous genes, or from co-regulated genes in microarrays, as it reduces the signal/noise ratio in a very significant manner, still detecting the real functional sites.
overlap
overlap is a program that computes the overlap between two sets of genomic features. More precisely it takes two gff files of genomic features as input and for each feature of the first set, says whether it is overlapped by a feature of the second set (basic mode, however more and more precise information can be retrieved).
EXECUTE INSTRUCTIONS:
After getting the overlap executable, type:
./overlap
without any argument to get the help.
This will give you all the possible options of overlap. Basically the output will be equal to the first input file (file1) with additional information about the overlap of its features with the features of the second file (file2). There are 4 basic modes:- Mode 0 (option -m 0) is to report boolean overlap: 1 if the feature of file1 is overlapped by a feature of file2, 0 otherwise.- Mode 1 (option -m 1) is to report quantitative overlap: the number of file2 features overlapping a file1 feature. - Negative mode (option -m -1 for example) is to report the list of coordinates of file2 features overlapping a file1 feature.- Value mode (option -m n where n>=10 and n is even) is to report the list of values located in field 10 in file2 associated to file2 features overlapping a file1 feature.
You can also ask for inclusion instead of general overlap, this is the option -i, or make a stranded overlap, this is the option -st.
Do not hesitate to contact sarahqd at gmail dot com for more information.
PATRONUS
PATRONUS (from "PATtern Recognition by Optimized Numerical Universal Scoring") is a program designed to compute in a very fast way the exact probability of observing a given number of occurrences of a simple motif (that is, a continuous word without gaps) in a sequence. Its intended scope is the analysis of very long biological sequences, like chromosomes or whole genomes of complex organisms. The probability is computed on the basis of the Markovian statistics of order m for the sequence, that is the recorded number of the occurrences of all the submotifs of length m + 1 in the sequence. Contrary to what many people believe, computing such a probability for a generic motif is a computationally demanding task, mainly because motifs can overlap in non-trivial ways.
A detailed description of both the PATRONUS algorithm and its excellent performance can be found here.
PhylomeDB
PhylomeDB is a public database for complete collections of gene phylogenies (phylomes). It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes , including Maximum Likelihood or Bayesian tree inference, alignment trimming and evolutionary model testing. PhylomeDB includes also a public download section with the complete set of trees, alignments and orthology predictions.
Predictor of Interactions with Molecular Chaperones
We use a series of stringent relationships between abundance, solubility and chaperone usage of proteins. Based on these relationships, we show that the need of Escherichia coli proteins for the chaperonin GroEL can be predicted with 86% accuracy. Furthermore, from the observation that the abundance and solubility of proteins depend on the physicochemical properties of their amino acid sequences, we demonstrate that the requirement for GroEL can also be predicted directly from the sequences with 90% accuracy. These results indicate that the physicochemical properties of the amino acid sequences represent an essential component of the cellular quality control system that ensures the maintenance of protein homeostasis in living systems.
Predictor of Protein Amyloidogenicity
Protein aggregation causes many devastating neurological and systemic diseases and represents a major problem in the preparation of recombinant proteins in biotechnology. Major advances in understanding the causes of this phenomenon have been made through the realisation that the analysis of the physico-chemical characteristics of the amino acids can provide accurate predictions about the rates of growth of the misfolded assemblies and the specific regions of the sequences that promote aggregation. More recently it has also been shown that the toxicity in vivo of protein aggregates can be predicted by estimating the propensity of polypeptide chains to form protofibrillar assemblies.
Predictor of Protein Folding Propensities
With the advent of proteomics, there is an increasing need of tools for predicting the properties of large numbers of proteins by using the information provided by their amino acid sequences, even in the absence of the knowledge of their structures. One of the most important types of predictions concerns whether proteins will fold or aggregate. These profiles are calculated, respectively, using the CamFold method, which we introduce in this server, and the Zyggregator method. Our results indicate that the kinetic behavior of proteins is, to a large extent, determined by the interplay between regions of low folding and high aggregation propensities.
Predictor of protein-RNA interactions
Fast predictions of RNA-Protein interactions and domains
Online service: catRAPID v2.0
Please note:
Query sequences should be pasted in the 'Protein sequence' and 'RNA sequence' text area.
The catRAPID server accepts only amino acid and nucleic acid sequences, defined as lines of sequence data, without the FASTA definition line, sequence identifiers and/or other symbols; eg:
MSEYIRVTEDENDEPIEIPSEDDGTVLLSTVTAQFPGACGLRYRNPVSQCMRGVRLVEGILHAPDAGWGNLVYVVNYPKDNKRKMDETDASSAVKVKRAVQKTSDLIVLGLPW KTTEQDLKDYFSTFGEVLMVQVKKDLKTGHSKGFGFVRFTEYETQVKVMSQRHMIDGRWCDCKLPNSKQSPDEPLRSRKVFVGRCTEDMTAEELQQFFC
General guidelines are reported here below:
1. Use one word to identify your protein and your RNA
2. Use standard IUB/IUPAC amino acid and nucleic acid codes.
The nucleic acid codes supported are:
A adenosine; C cytidine; G guanine; U uridine.
The accepted amino acid codes are:
A alanine; C cystine; E glutamate; D aspartate; F phenylalanine; G glycine; H histidine; I isoleucine; K lysine; L leucine;
M methionine; N asparagine; P proline; Q glutamine; R arginine; S serine; T threonine; V valine; W tryptophan; Y tyrosine.
3. Blank lines are not allowed in the middle of sequence input (no spaces or next line)
4. Use of sequence identifiers, such as simply accession, accession.version or gi's (e.g., p01013, AAA68881.1, 129295) is not supported in the protein and RNA sequence fields.
Predictor of Soluble Expression
Each step in the process of gene expression, from the transcription of DNA into mRNA to the folding and posttranslational modification of proteins, is regulated by complex cellular mechanisms. At the same time, stringent conditions on the physicochemical properties of proteins, and hence on the nature of their amino acids, are imposed by the need to avoid aggregation at the concentrations required for optimal cellular function. A relationship is therefore expected to exist between mRNA expression levels and protein solubility in the cell. By investigating such a relationship, we formulate a method that enables the prediction of the maximal levels of mRNA expression in Escherichia coli with an accuracy of 83% and of the solubility of recombinant human proteins expressed in E. coli with an accuracy of 86%.
project
project is a program that projects genomic features onto their sequences. Please contact Sarah Djebali (sarah dot djebali at crg dot es for any question).
SECISaln
SECISaln will predict a SECIS element in the query sequence, split it into its constituent parts and align these against a precompiled database of eukaryotic SECIS elements. The user can choose whether the database sequences are sorted by protein family or by species, thereby offering the possibility of comparing the submitted sequence to other, known SECISes. In addition, SECISaln returns a graphical image of the predicted structure of the user-submitted sequence as well as a multiple structural alignment of all SECIS elements of that type already present in the database.
Selenoprofiles
Selenoproteins are a group of proteins that contain selenocysteine (Sec), a rare amino acid inserted co-translationally into the protein chain. The Sec codon is UGA, which is normally a stop codon. In selenoproteins UGA is recoded to Sec in presence of specific signals on selenoprotein gene transcripts. Due to the dual role of the UGA codon, gene prediction programs fail to predict correctly selenoproteins. Selenoprofiles is an homology-based in silico tool able to scan genomes for members of the known selenoprotein families, thus finding both selenoproteins and cysteine homologues. Selenoprofiles is built in python, and it internally runs psitblastn, exonerate, genewise and SECISearch.
Selenoprofiles is tuned to search for selenoprotein genes, and comes out-of-the box with profile alignments for each known selenoprotein and selenocysteine-related family (Note: profiles will be released soon. The current release contain only the program and a single profile for example).
Selenoprofiles can be used to search for any protein family (also non-selenoprotein), given an input profile alignment. This pipeline combines standard gene prediction tools to provide a clean and fast way to scan genomes for protein families, and provides a wide repertoire of output formats which can also be extended by the user. The program allows for a deep level of customization, and provides many built-in methods to filter spurious hits.
Selenoprofiles version 2.2 is now available. Version 1 is no longer maintained -- if you need help with version 1, contact us by mail. This version features major improvements on the previous ones, such as:
- improved workflow control
- prediction by blast can be output, allowing use of selenoprofiles in bacterial genomes (exonerate and genewise are eukaryote specific)
- lazy computing implemented
- pre-clustering of the profile alignment: multiple blast are run if the profile is highly variable
- an SQLite database is used to store results, allowing to search for a high number of families without producing an enourmous amount of files, since they can be deleted at the end of computation
- improved customization of the options used with the slave programs, which can potentially be different for each profile
- improved filtering of results: all filtering procedures are defined as pieces of python code which are run internally in selenoprofiles. Several methods useful for filtering are provided. Filtering can be customized for each family
- intra-family and inter-family redundancy of results is removed
- tag blast and gene ontology extensions implemented for filtering (see manual)
Tools for graphical representation of selenoprofiles results are under development and will be released in the next few months.
MANUAL
Download the last version of selenoprofiles manual here: http://genome.crg.es/~mmariotti/selenoprofiles_manual.pdf
INSTALLATION:
For selenoprofiles to work, all the slave programs that it utilizes must be already installed in your machine (blastall, exonerate, genewise). You will also need some external python modules if you want to use all its functionalities. These additional modules are needed if you want to scan genomes for selenoproteins, but may be omitted if you want to scan for your protein family of interest. In this page you can find help to install the slave programs and the additional python modules.
You can either use svn to fetch the latest selenoprofiles package:
svn co --username guest --password guest svn://svn.crg.es/big/selenoprofiles/trunk selenoprofiles
or download this tarball. Then, follow the instructions in the README.
CITATION:
Selenoprofiles was published in Bioinformatics. To read the article or access the online data, check this page. Please cite:
Mariotti M, Guigo R - Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes.Bioinformatics. 2010 Nov 1;26(21):2656-63. Epub 2010 Sep 21
Contact: marco.mariotti@crg.eu
sgp2
sgp2 is a program to predict genes by comparing anonymous genomic sequences from different species. It combines tblastx, a sequence similarity search program, with geneid, an ab initio gene prediction program.
You will also find whole genome annotations for different species obtained with sgp2 in our "Gene Predictions" web pages.
SymCurv
SymCurv is a computational ab initio method for nucleosome positioning prediction. It is based on the structural property of natural nucleosome forming sequences, to be symmetrically curved around a local minimum of curvature. The method takes as input the primary DNA sequence, calculates the expected curvature from which it deduces possible centers of nucleosomal sequences, by imposing symmetry constraints. SymCurv's performance is comparable to existing tools but offers the additional advantages of predicting nucleosome positions under two assumed-states (stationary and dynamic) providing insight on the remodelling potential of nucleosomes of possible regulatory function.
The Flux Capacitor
The Flux Capacitor predicts abundances for transcript molecules and alternative splicing events from RNAseq experiments. Additionally, there is a simulation pipeline that is capable to simulate whole transcriptome sequencing experiments.
The GEM (GEnome Multi-tool) Library
The GEM (GEnome Multi-tool) Library is a set of very optimized tools for indexing/querying huge genomes/files. Provided so far are a very fast exhaustive mapper (the GEM mapper), an unconstrained split mapper (the GEM split mapper), and a very fast program to compute genome mappability (the GEM mappability).
treeKO
TreeKO is a python package used to compare phylogenetic trees. Currently it contains two different programs:
Tree comparison The tree comparison algorithm has been designed in order to be able to compare trees that have undergone gene loss and gene duplication processes and therefore do not necessarily have the same number of leaves. TreeKO computes all the possible pruned trees in each original tree by splitting the trees by the duplication nodes and reassembling the trees with combinations of pruned trees. The pruned trees are then compared all against all, pruning the leaves that are not common in both pruned trees. TreeKO offers two distance measures that are modifications of the Robinson & Foulds distance. The speciation distance will compute the distance between two trees without penalizing for gene loss and duplication events. On the other hand the strict distance will compute the distance between the complete structures of the two trees.
Phylome support The phylome support algorithm has been designed as a way to identify conflicting nodes and to incorporate genome-wide information on species trees. The algorithm is able to map gene-tree variability levels of large groups of gene trees (e.g. a whole phylome) on the nodes of the species tree.
trimAl
A tool for automated alignment trimming, which is especially suited for large-scale analyses. Its speed and the possibility for automatically adjusting the parameters to optimize the phylogenetic signal-to-noise ratios for different families, makes trimAl especially suited for large-scale phylogenomic analyses, involving thousands of large multiple sequence alignments.






