In this article we will discuss about the study of comparative genomics.

All the genes of an organism are not functional. In different groups of organisms the percentage of functional genes varies. For example in bacteria 3-5 genes are non-functional, whereas in humans 97 % genes are non-functional. Besides, the level of evolutionary conservation of microbial proteins is rather uniform with 70% of gene products.

Each of the sequenced genomes has homologs in distant genomes. Thus, the function of many of these genes can be predicted by comparing different genomes and by transferring functional annotation of proteins from better studied organisms to their orthologs from lesser studied organisms.

Based on the above facts, study of comparative genomics proved a powerful approach for achieving a better understanding of the genomes and, subsequently of the biology of the respective organisms. Recently, some of the genome of the microorganisms viz. Haemophilus influenzae, Mycoplasma genitalium, Methanococcus jannaschii, Saccharomyces cerevisiae, Escherichia coli. Bacillus subtilis have been fully sequenced.

Computational analysis of complete genomes requires a database (a repository of gene structure of organisms) that store genomic information’s and bioinformatics tools. To study completely sequenced genomes, analysis of nucleic acids, proteins, etc. are required. Now-a-days even the analysis of protein sets also proved as a tool to study genome analysis.

Thus, it is possible to know by comparing different genomes and by transferring functional annotation of proteins from better-studied organisms to their orthologs [i.e. genes that are connected by vertical evolutionary descent (the “same” gene in different species)] as opposed to paralogs (i.e. genes related by duplication within a genome) from lesser-studied organisms.

This makes comparative genomics a powerful approach to achieving a better understanding of the genomes, and subsequently of the biology of the respective organisms.

Databases for Comparative Genomics:

World Wide Web (www) is acces­sible to anyone by using Internet.

(a) PEDANT:

This database gives in­formation about the proteins, their three-dimensional structures, enzyme patterns, PROSITE patterns, Pfam domains, BLOCKS and SCOP domains as well as PIR keywords and PIR super families.

Mycoplasma the Genome of which has been fully sequenced

(b) COGs:

Clusters of Orthologs Groups (COGs) are applicable to simplify evolutionary studies of complete genomes and improve functional assignments of individual proteins. It comprises of -2,800 conserved families of proteins from each of the sequenced genomes.

It contains orthologus sets of proteins from at least three phylogenetic lineages which are assumed to have evolved from an individual ancestral protein. The functions of orthologs are same in all organisms.

The protein families in the COGs database are separated into 17 functional groups that include a group of uncharacterized yet conserved proteins as well as a group of proteins for which only a general functional assignment appeared appropriate.

In COGs database due to storage of diverse nature of data on proteins, the similarity searches also give some information for those proteins which has no clear information’s in databases. The databases also act as a tool for a comparative analysis of complete genomes.

(c) KEGG:

Kyoto Encyclopedia of Genes and Genomes (KEGG) Centers on cellular metabolism was proposed by Kaneshisa and Goto (2000). A comprehensive set of metabolic pathway charts both general and specific has been given for the sequenced (genome) organism. In this, enzymes identified in a particular organism are colour-coded, so that one can easily trace the pathways.

It also provides the enzymes coded for the orthologus genes. These genes if located adjacent to each other, form like operons, for example comparison between two complete genomes in which genes are located relatively close or adjacent (with in five genes) can be made. This site is useful to get information’s for the analysis of metabolism in various organisms.

(d) MBGD:

Microbial Genome Database (MBGD) is situated in the University of Tokyo, Japan. This database helps to search for microbial genomes. MBGD accept the several sequences at once (-2000 residues) for searching against all of the complete genomics available displays colour-coded functions of the detected homologs, and their location on a circular genome map. This database also gives information’s regarding the functions e.g. degradation of hydrocarbon or biosynthesis of nucleotides, etc.

(e) WIT:

Similar to KEGG, WIT (“What is there” database) gives information’s regarding metabolic reconstruction for completely sequenced genomes. The WIT features are to provide sequence of reactions between two bifurcations besides to include proteins from many partially- sequenced genomes. These features of WIT provide many more information’s on the sequences of the same proteins/enzymes obtained/from different organisms.

Bioinformatics Subgroups:

The bioinformatics has more subgroups viz. networking, sequence database and alignment theories, phylogenetic analysis, secondary structure predictions and DNA analysis, bio molecular structures, dynamics and function, protein motifs, modeling analysis of 3- D structures of macromolecules, applications in the discovery of synthetic molecules to heat, human diseases, and molecular mechanisms involved with gene regulation, etc.

Steps of Sequence Formation:

The tool of bioinformatics provides the analysis of se­quence information.

This process involves:

i. Identifying the genes in the DNA se­quences from various organisms.

ii. Developing methods to study the struc­ture and/or functions of newly identified sequences and corresponding structural RNA sequences.

iii. Identifying families of related sequences and the development of models.

iv. Aligning similar sequences and generating phylogenetic trees to examine evolutionary relationships.

Decoding the information of Biological Sequences

To know the biological and biophysical knowledge, conversion of sequence information is required. Information’s of the biological sequence can deciphere the structural, functional and evolutionary clues encoded in the languages of biological sequences. The decoding of languages may be decomposed into sentences (proteins), words (motifs) and letters (amino acids), and the code may be tackled at a variety of these levels.

A single letter change within a word can sometimes change its meaning for example, a chain codon for glutamic acid (GAA) to valine (GUA) in homozygous individuals. This minute difference results in a change from a normal healthy state to fatal sickle cell anaemia.

Basic Requirements:

Following are some of the requirements:

a. Biological research on the web.

b. Sequence analysis, pair wise alignment and database searching.

c. Multiple sequence alignments, trees and profiles.

d. Visualizing protein structures and computing structural properties.

e. Predicting protein structure and function from sequence.

f. Tools for genomics and proteomics.

The well known packages (softwares) for DNA and protein sequence analysis include Staden and Gene world (for DNA and protein sequence); Gene Thesarus (access to public data and integration with proprietary data), Lasergene (for coding analysis, pattern site matching, structure and comparative analysis, restriction site analysis, PGR primer and probe designing, sequence editing, assembly and analysis, etc.), CINEMA (package provides facilities for motif identification using BLAST), EMBOSS (using nucleotide sequence pattern analysis, codon usage analysis, gene identification tools, protein motif identification and rapid database searching with sequence pattern), EGCG (for fragment assembly, mapping multiple sequence analysis, pattern recognition nucleotide and protein sequence analysis, etc.).

The biological data and information storage are given below in Table 27.12:

Biological Data and their Source of Storage of Informations

Classification of Databases:

The databases are broadly classified into two categories: sequence databases (it involves both proteins and nucleic acid sequences), and structural databases (it involves only protein databases).

Moreover, it is also classified into three categories:

(a) Primary database,

(b) Secondary database, and

(c) Composite database.

Primary databases contain information of the sequence or structure alone of either protein or nucleic acid e.g. PIR or protein sequences, GenBank and DDBJ for genome sequences. Secondary databases contain derived informations from the primary databases, for example informations on conserved sequence, signature sequence and active site residues of protein families by using SCOP, eMOTIF, etc.

The composite database is obviating the need to search multiple resources. The SCOP is structural classification of proteins in which the proteins are classified into hierarchical levels such as classes, folds, superfamilies.

Comparative Modelling or Homology Modelling:

It is useful in aligning two sequences to identify segments that share similarity. It later identifies the structure of desired protein. After predicting the structure of the homology, rigid body assembly approach is applied for assembling the structure that represents the core loop regions, side chains, etc. In sediment matching procedure, coordinates are calculated from approximate position of conserved atoms of the templates.

The alignment of the sequence of interest with one or more structural templates can be used to derive a set of distance constraints which gives informations on distance geometry or retrained energy minimization or retained molecular dynamics to obtain the structure.

(a) Threading:

It is a technique to match a sequence with a protein shape in the absence of any substantial sequence identity to proteins of known structure, whereas comparative modelling requires protein sequences.

Threading is followed by scoring, that creates a profile for each site or using a potential based pair wise interaction. Potential energy functions may be obtained from ab initio quantum mechanical calculations or from thermodynamic, spectroscopic or crystallographic method or by combination method.

(b) Sequence analysis:

In order to understand the protein/nucleic acid structure and evolution, the analysis of their sequence data is required. The sequence analysis is the detection of homologus (orthologus: same function, different species) or paralogus (different but related functions within one organism) relationships by means of routine database searches.

Some of the important resources are outlined in the following:

The primary structure of protein lies in its amino acids sequences which are stored in primary databases as linear alphabets that denote the constituent residues. The secondary structure of a protein corresponds to regions of local regularity e.g. α-helices and β-helices which in sequence alignments are often apparent as well conserved motifs.

These are stored in secondary databases of patterns e.g. regular expressions, fingerprints, blocks, profiles, etc. The tertiary structural elements, which may form discrete domains within a fold (a, b, c), or may give rise to autonomous folding units or modules (such as @, *, #), complete fold, domains and modules are stored in structural databases as sets of atomic co­ordinates.

Analysing DNA Sequences

Home››Genomics››