PANDITplus

Keyword Pfam Clan Gene   

Why PANDITplus ?

PANDITplus is built as an extension of PANDIT, the database of Pfam alignments and phylogenetic trees for known protein domains and families spanning lineages from the three domains of life. It makes a step towards the integration of data from a variety of reliable and curated bioinformatics sources. Along with DNA and amino acid sequence data for homologs, PANDITplus provides access to pre-computed estimates from evolutionary codon models, data on protein interactions, functional and chemical pathway annotation, gene expression, and association with disease.The idea behind PANDIT database was to encourage the evolution-centric analyses of protein domains and families, based on reliable sets of HMM-based alignments and associated phylogenetics trees.


Structure of PANDITplus

PANDITplus is a MySQL database built upon the PANDIT version 17.0. PANDIT is a database of homologous sequence alignments accompanied by estimates of their corresponding phylogenetic trees. It provides a valuable resource to those studying phylogenetic methodology and the evolution of coding-DNA and protein sequences. All alignments in PANDIT are based on the Pfam-A seed alignments, which are manually curated and therefore comparable with the carefully crafted alignments used to study molecular evolution. Each PANDIT entry includes an amino acid and codon based alignments (with reliability estimates for each column) and estimates of phylogenetic trees inferred from these alignments.

The figure below displays the integration of different biological resources. The color of the links shows where the mapping is taken from. For example the IDs mapping between KEGG and HumanProteinpedia were taken from KEGG and the link is with yellow color.

Pfam at the core

Annotations for each PANDIT entry were taken from the Pfam database version 24.0. Pfam is a large collection of protein families, each represented by multiple sequence alignments and profile hidden Markov models. We used Pfam-A families that each consist of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov model (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences from the family, as defined by profile HMM searches of primary sequence databases. The sequences in the Pfam-A entries are taken from the most recent releases of UniProtKB and NCBI GenPept. PANDITplus includes general information from the Pfam database about protein families and domains with corresponding literature references, and groupings of related families and domains into clans. Information on family interactions and structural complexes that Pfam members can form, as well as links to the corresponding PDB structures, are also taken from the Pfam database. More detailed presentation of the 3D structural information will be made in the future updates.

Pre-computed estimates from Markov codon substitution models

Positive or negative selection studies based on codon substitution models are powerful means of studying the evolution of genes and gene families (for review see [1]). Thus we believe that maximum estimates from several standard codon models may provide illuminating details about the mode of evolution of a protein family or domain. Positive selection pressure is measured by the ratio ω of nonsynonymous to synonymous substitution rates dN/dS. PANDITplus includes the estimates and log-likelihood scores from models M0, M1, M2, M7, and M8 [2], implemented in PAML v.4.1 [3], as well as from codon models with constant and variable synonymous rate [4], implemented in HYPHY [5]. Family entries in PANDITplus are classified as positively selected or conserved based on maximum likelihood (ML) estimations and using likelihood ratio tests (LRTs) comparing couples of nested models, one of which (the null hypothesis) does not allow for positive selection (so ω ≤ 1), whereas another one does (the alternative hypothesis). Positive selection is detected if a model allowing sites or lineages under positive selection fits data significantly better than the model restricting selective pressure to ω ≤ 1 at all sites and lineages. Positions identified to be under positive selection with different models are also stored in the database. All optimizations were done assuming PANDIT trees and branch lengths were fixed to their ML estimates under the constant selective pressure model M0. Such practice is commonly used to reduce computational times, since M0 typically provides robust estimates of branch lengths. To minimize the effect of inaccuracies in automatic alignments, the analyses of selective pressure were performed only for sites where alignment was deemed reliable based on HMM analyses [6].

The availability of pre-computed results aims to increase the transparency and reproducibility of statistical analyses, and the continuity of biological studies. Note that some concerns were expressed in the literature about the suitability of codon models for evolutionary estimation on alignments of distant sequences (e.g. [7]), as for deep divergences synonymous substitutions may become saturated. We therefore recommend users to discount the entries where divergence is too large, eg, average branch length > 2 expected amino acid substitutions per branch. However, such high divergences are rarely encountered in PANDITplus: only 8 entries fall beyond the above mentioned threshold. Equally, we recommend a cautious use of ω estimates when either dN or dS are close to zero (see also PAML manual). Note that Anisimova et al (2001) [8] found that likelihood ratio test for positive selection remained accurate for both saturated and very similar data, although the power of the test decreased. Moreover, Seo and Kishino (2008) [9] investigated the effect of applying a codon model to divergent data sets. Surprisingly, they found that codon model never performed worse than the amino acid model, despite large saturation of synonymous substitutions in their data.

  1. Anisimova M. and C. Kosiol. 2009. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26:255-271.
  2. Yang, Z., R. Nielsen, N. Goldman, and A. M. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449.
  3. Yang, Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586-1591.
  4. Kosakovsky Pond, S. L., and S. V. Muse. 2005. Site-to-site variation of synonymous substitution rates. Mol Biol Evol 22:2375-2385.
  5. Kosakovsky Pond, S. L., S. D. Frost, and S. V. Muse. 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676-679.
  6. Whelan, S., P. I. de Bakker, E. Quevillon, N. Rodriguez, and N. Goldman. 2006. PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res 34:D327-331.
  7. Maynard Smith, J. and N.H. Smith. 1996. Synonymous nucleotide divergence: what is saturation?. Genetics 142, pp. 1033-1036
  8. Anisimova M., J.P. Bielawski and Z. Yang. 2001. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol. 18(8):1585-1592.
  9. Seo, T.K. and H. Kishino. 2008. Synonymous substitutions substantially improve evolutionary inference from highly diverged proteins. Systematic Biology. 57:367-377.

Gene annotation

PANDITplus includes gene annotation from Gene Ontology (GO) and from Kyoto Encyclopedia of Genes and Genomes (KEGG).

GO is a database of standardized biological terms for gene products. GO contains more than 26'000 terms, divided in three branches: Molecular Function, Biological Process and Cellular Component. Each branch can be represented as a directed acyclic graph relating terms of different degrees of specificity, with directed links from less specific to more specific terms. Each node in the graph can have several parents (broader related terms) and children (more specific related terms). GO annotations for each PANDIT sequence were taken from the Pfam database, and then GO database was parsed to obtain the proper hierarchical structures of GO terms.

KEGG is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional information. PANDITplus integrates information from KEGG PATHWAY database, which contains manually drawn pathway maps representing the knowledge on the molecular interaction and reaction networks. KEGG pathways are structured as a directed acyclic graph hierarchy of three flat levels. The top level consists of the following five categories: Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes and Human Diseases. The next levels divide the five functional categories into finer sub-categories. The KEGG database provides direct mapping from KEGG gene IDs to Pfam entries, as well as to many other biological data sources. One KEGG gene could be mapped to many Pfam entries. Each PANDITplus family record contains pathway information of all the genes that are associated with the family.

Expression data

Gene expression data provides further insight into the dynamics of protein function across different species or over a series of tissues and conditions. Information on expression patterns of all the genes that code for one protein or protein domain could shed light to the function of the domain in specific tissues. For example, one can evaluate the functional importance of the domain of interest in a specific tissue by observing the number of coding genes with this domain, which are expressed in that tissue. Furthermore, when species-specific expression data is available, it can be useful for studies of evolutionary patterns in genes and gene families. Together with the evolutionary model estimation, contrasting gene expression over several species or among tissues may also give insights to functional variation and evolutionary forces acting upon a gene.

PANDITplus contains expression data from two recent resources, Bgee database and Human Proteinpedia, together with an overview of gene numbers expressed in various tissues.

Gene expression data for five species (Human, Zebrafish, Drosophila, Mouse and Xenopus) were extracted from Bgee release 6.0 . Bgee focuses on the developmental aspect of gene expression and contains EST data from UniGene, Affymetrix data from ArrayExpress, and in situ hybridization data from ZFIN and MGI. Each data entry is manually annotated by Bgee curators and mapped onto anatomical and developmental ontologies. The mapping to the Bgee database was done via UniProt, such that individual sequences from the seed alignment from PANDIT were mapped to UniProt IDs, and the UniProtIDs were mapped to the IDs used in Bgee, with a full path PANDIT→UniProt→Bgee. Each PANDITplus family record presents tissue expression data from Bgee of the genes that code for sequences from the PANDIT seed alignment, which was also used to obtain the ML estimates from codon models (see above). Therefore, Bgee expression data is displayed only for the species that contribute a sequence in the corresponding PANDIT alignment. These expression data may be used in combination with ML estimates from phylogenetics analyses. Since one Pfam domain can be present in several genes, each with their own expression patterns, PANDITplus counts the number of these genes expressed in each tissue, and sorts the tissues by the number of associated genes expressed in it. Note that interpretation of inferences based on links of gene-wise information to protein domains may not be straightforward. However, we believe that certain observations from such links may be very useful (e.g., prevalence of a domain in genes expressed in a particular tissue).

Gene expression data for healthy and disease human tissues were extracted from Human Proteinpedia, a community portal for sharing and integration of human protein data. Human Proteinpedia allows research laboratories to contribute and maintain protein annotations. All human data in Human Proteinpedia is derived experimentally and comes from different laboratories. The mapping to Human Proteinpedia gene IDs was done for the sequences from the full Pfam alignment via KEGG database, with the whole mapping path being PANDIT→PFAM→KEGG→HumanProteinpedia. Each PANDITplus family record presents tissue expression data from Human Proteinpedia of all the genes that are associated with the full alignment of the Pfam family. For each tissue the number of genes expressed in that tissue is also displayed. The expression data in PANDITplus derived from Human Proteinpedia is only for human tissues. Since the mapping of the expression data from this resource is based on the sequences from the full alignment, such data is shown when the full alignment contain at least one human sequence, even if the seed alignment do not contain any. Due to the fact that our Markov codon model estimates are based on the PANDIT seed alignment, we recommend that Human Proteinpedia expression data are used only when a human sequence is contained in the PANDIT seed alignment. The families that do not contain any human sequence in the PANDIT seed alignment are marked in PANDITplus.

Disease association

PANDITplus incorporates information on associations of human genes with genetic disease and disorders extracted from the Genetic Association Database. The data in this database come from published scientific papers and this database serves as an archive of human genetic association studies of complex diseases and disorders. Information on associations with human disease is also available via Online Mendelian Inheritance in Man. Links to OMIM records were taken from KEGG database.

Interpreting the disease information should be taken with cautions, since one Pfam domain can be present in several genes, each linked to different disease and disorders. Due to this fact PANDITplus counts the number of these genes per disease, and sorts the disease records by the number of domain coding genes linked with them. Furthermore, PANDITplus includes literature references providing information for each gene linked to certain disease record.


Panditplus schema diagram (ER)

Family and clan information

Family and clan information tables

Markov codon model estimates

Markov codon model estimates tables

Interactions, PDB links, Gene ontologies and KEGG pathways

Interactions, PDB links, Gene ontologies and KEGG pathways tables

Gene expression information

Gene expression data tables

Diseases and disorders information

Diseases and disorders tables

Glossary

Bgee

A database to retrieve and compare gene expression patterns between animal species.

Clan

A collection of related Pfam entries. The relationship may be defined by similarity of sequence, structure or profile-HMM.

Codon models

Evolutionary models, widely used to infer the selection forces acting on a protein.

Domain

A structural unit which can be found in multiple protein contexts.

Family

A collection of related proteins.

GO

Gene Ontology database, a major bioinformatics that provides an ontology of defined terms representing gene product properties.

Hidden Markov model (HMM)

A HMM is a probablistic model. In Pfam HMMs are used to transform the information contained within a multiple sequence alignment into a position-specific scoring system. HMMs are searched against the (UniProt) protein database to find homologous sequences.

Motif

A short unit found outside globular domains.

KEGG

Kyoto Encyclopedia of Genes and Genomes is a bioinformatics resource for linking genomes to life and the environment.

Pfam-A

A HMM based hand curated Pfam entry which is built using a small number of representative sequences. We manually set a threshold value for each HMM and search our models against the UniProt database. All of the sequnces which score above the threshold for a Pfam entry are included in the entry's full alignment.

Positive selection

A particular mode or mechanism of natural selection. The selective pressure on the protein-coding level can be measured through the comparison of the nonsynonymous and synonymous substitution rates. In the traditional framework, positive selection favours adaptive nonsynonymous changes, thus the nonsynonymous substitution rate is higher than the synonymous substitution rate, and their ratio is >1.

Repeat

A short unit which is unstable in isolation but forms a stable structure when multiple copies are present.

Seed alignment

An alignment of a set of representative sequences for a Pfam entry. Our maximum likelihood estimates from Markov codon models are based on the seed alignment from PANDIT.

Synonymous substitution

Evolutionary substitution of one base for another in an exon of a protein coding gene, such that the amino acid sequence produced is not modified.


Credits

Data sources:

Software:

Server:


Citing PANDITplus

  • Dimitrieva S. and M. Anisimova. 2009. PANDITplus: Towards better integration of evolutionary view on molecular sequences with supplementary bioinformatics resources. Trends in Evol Biol. pdf


Linking to PANDITplus

Homepage

      http://www.panditplus.org

Linking a family

      Example of linking by id http://www.panditplus.org/show.php?in=PF02109&show=family

      Example of linking by name http://www.panditplus.org/show.php?in=7tm_1&show=family

Linking a clan

      Example of linking by id http://www.panditplus.org/show.php?in=CL0163&show=clan

      Example of linking by name http://www.panditplus.org/show.php?in=GPCR_A&show=clan

Contact us

For any questions, requests or comments, please contact {slavica.dimitrieva or maria.anisimova}@isb-sib.ch.