Bioinformatics: Databases, Tools and Implications for Biodiversity

Read this article to learn about databases, tools and implications of bioinformatics for biodiversity.

Contents

What is Bioinformatics?
Principles of Sequence Similarity Searches:
Substitution Scores and Substitution Matrices:
Statistics of Protein Sequence Comparison:
Protein Sequence Complexity: Compositional Bias:
Gene Identification and Software Tools:
Sequence Alignment and Similarity Search:
Sequence Database Search Algorithms:
Motifs, Domains, and Profiles:
Protein Domains, PSSMs, and Advanced Methods for Database Search:
Bioinformatics for Learning the Intricacies of Biodiversity:

What is Bioinformatics?

Roughly, bioinformatics describe any use of computers to handle biological information.

In practice, a narrower definition is used: bioinformatics is a synonym for “computational molecular biology”—the use of computers to characterize the molecular components of living things.

Most biologists talk about “doing bioinformatics” when they use computers to store, retrieve, analyze or predict the composition or the structure of biomolecules.

As computers become more powerful one could probably add simulate to this list of bioinformatics verbs. “Biomolecules” include the genetic material—nucleic acids—and the products of genes: proteins. These are the concerns of “classical” bioinformatics, dealing primarily with sequence analysis. The bioinformatics has been considered “the mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information.” NIH has defined bioinformatics as “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”

Obviously, we have such overlapping disciplines as Computational Structural Biology, Molecular Structural Biology, Bio informatics, Genomics, Structural Genomics, Proteomics, Computational Biology, Bioengineering and so on. However, we may consider the following scope for bioinformatics :

Bioinformatics methods = biology + computer science

Bioinformatics subject area = Sequence + Function + Structure of biomolecules. It is a mathematically interesting property of most large biological molecules that they are polymers ; ordered chains of simpler molecular modules called monomers. Think of the monomers as beads or building blocks which, despite having different colours and shapes, all have the same thickness and the same way of connecting to one another. Monomers that can combine in a chain are of the same general class, but each kind of monomer in that class has its own well-defined set of characteristics.

Many monomer molecules can be joined together to form a single, far larger, macromolecule. Macromolecules can have exquisitely specific informational content and/or chemical properties. According to this scheme, the monomers in a given macromolecule of DNA or protein can be treated computationally as letters of an alphabet, put together in pre-programmed arrangements to carry messages or do work in a cell.

The greatest achievement of bioinformatics methods is the Human Genome Project. One of the fallouts is that the field of biology is changing from being a descriptive to an analytical science. Accurate and consistent descriptions are now not just needed but are vital to analysis. Because of this the nature and priorities of bioinformatics, research and applications are changing. People often talk portentously of our living in the “post- genomic” era.

One View is that this will Affect Bioinformatics in Several Ways:

(i) Now we possess multiple whole genomes and we can look for differences and similarities between all the genes of multiple species. From such studies we can draw particular conclusions about species and general ones about evolution. This kind of science is often referred to as comparative genomics.

(ii) There are now technologies designed to measure the relative number of copies of a genetic message (levels of gene expression) at different stages in development or disease or in different tissues. Such technologies for gene expression studies, such as DNA microarrays will grow in importance.

(iii) Other, more direct, large-scale ways of identifying gene functions and associations (for example yeast two-hybrid methods) will grow in significance and will lead to the growth of accompanying bioinformatics of functional genomics.

(iv) There will be a general shift in emphasis (of sequence analysis especially) from genes themselves to gene products. This will lead to: attempts to catalogue the activities and characterize interactions between all gene products (in humans): proteomics, and attempts to crystallize and/or predict the structures of all proteins (in humans): structural biology.

(v) What is often referred to as research or medical informatics, the management of all biomedical experimental data associated with particular molecules or patients— from mass spectroscopy, to in vitro assays, to clinical side-effects—will move from the concern of those working in drug company and hospital I.T. (information technology) into the mainstream of cell and molecular biology and migrate from the commercial and clinical to academic sectors.

It is worth noting that all of the above non-classical areas of research depend upon established sequence analysis techniques.

Principles of Sequence Similarity Searches:

The characterization of any new DNA or protein sequence starts with a database search to find out whether homologs of this gene (protein) are available, and in what detail. Clearly, looking for a matching sequence is quite straightforward. Take the first letter of the query sequence, search for its first occurrence in the database, and then check if the second letter of the query is the same in the subject.

If the two letters match, check the third, then the fourth, and continue this comparison to the end of the query. If the match for second letter fails, the search for another occurrence of the first letter will be done, and so on. This will identify all the sequences in the database that are identical to the query sequence (or include it).

Here we looked only for sequences that exactly match the query. To find sequences with the exclusion of the first letter, the same analysis may be conducted with the fragments starting from the second letter of the original query, then from the third one, and so on.

Query 1:1 KVRASVKKLCRNCKIVKRDGVIRVICSAEPKHKQRQG

Query 2:1 VRASVKKLCRNCKIVKRDGVIRVICSAEPKHKQRQG

Query 3:1 RASVKKLCRNCKIVKRDGVIRVICSAEPKHKQRQG

Query 4:1 ASVKKLCRNCKIVKRDGVIRVICSAEPKHKQRQG

These searches, at higher scale, become time-consuming. Finding close relatives would lead to additional conceptual and technical problems. Next, assume that sequences that are 99% identical are definitely homologous. Then, what is the threshold to consider sequences not to be homologous:50% identity, 33%, or perhaps 25% ? The example of two lysozymes shows that sequences with as low as low as 8% identity may belong to orthologous proteins and perform the same function.

Following the information theory of C E Shannon [The Mathematical Theory of Communication, 1949], we can calculate the information content of nucleic acids and of protein. If we use 2-bits (0 or 1 constitute a bit), we can encode 4 units of information (00, 01, 10, 11) which is sufficient to represent one base position in the DNA or RNA.

However, two bases (4-square) are not sufficient to code for the 20 amino acids that are used to constitute the various protein molecules. If we take three bases (4-cube), it gives us a code space of 64 which is more than the requisite 20. This redundancy leads to many codons for each amino acid, error-correcting codes and third place specialties (such as stop codon: TAA, TAG, TGA).

Another aspect is the execution of the “Central Dogma.” This is interesting in that it leads to introduction of noise from such sources as vector sequences, heterologous sequences, rearranged & deleted sequences, repetitive element contamination, frame shift errors and sequencing errors or natural polymorphism.

As a matter of fact, all the four nucleotides, A, T, C, and G, are found in the database with approximately the same frequencies and have roughly the same probability of mutating one into another. As a result, DNA-DNA comparisons are largely based on simple text matching, which makes them fairly slow and not particularly sensitive, although a variety of heuristics have been devised to overcome this.

In Contrast, Amino Acid Sequence Comparisons have Several Distinct Advantages, which, at least Potentially, Lead to a Much Greater Sensitivity:

(i) There are 20 amino acids but only four bases. Hence, an amino acid match carries with it > 4 bits of information as opposed to only two bits for a nucleotide match. Thus, statistical significance can be established for much shorter sequences in protein comparisons than in nucleotide comparisons,

(ii) There is redundancy of the genetic code. Almost one-third of the bases in coding regions are under a weak (if any) selective pressure and represent noise, which adversely affects the sensitivity of the searches,

(iii) Nucleotide sequence databases are much larger than protein databases because of the vast amounts of non-coding sequences coming out of eukaryotic genome projects, and this further lowers the search sensitivity,

(iv) Probably most importantly, unlike in nucleotide sequence, the likelihoods of different amino acid substitutions occurring during evolution are substantially different, and taking this into account greatly improves the performance of database search methods. Given all these advantages, comparisons of any coding sequences are typically carried out at the level of protein sequences ; even when the goal is to produce a DNA- DNA alignment (e.g. for analysis of substitutions in silent codon positions), it is usually first done with protein sequences, which are then replaced by the corresponding coding sequences. Direct nucleotide sequence comparison is indispensable only when non-coding regions are analyzed.

The laboratory-based as well as research-based sequencing and other types of information relating to the nucleic acids and the proteins are collected as bioinformatics databases in two broad categories: central repository (such as NCBI for nucleotide sequences, Swiss-Prot and PDB for protein sequences, and the smaller ones like Flybase, MGD for mouse genome and RGD for rat genome etc) and combined/secondary databases (such as KEGG for pathway and genome, prosite for annotated protein etc.).

The databases are of the most sophisticated type in the computer world and hence require organizational as well as voluntary support for maintenance and upkeep. In fact, the databases are not mere collection of sequences. For example, the PDB (Protein Data Bank) is the single largest worldwide repository for three-dimensional structures of large biological molecules and as early September 2006, it stores 38620 structures.

Thus it houses the sequence, atomic coordinates, derived geometric data, secondary structure content as well as annotations about protein literature references. The PDB was established with 7 structures in 1971 and in 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) was assigned to manage its affairs at Brookhaven National Laboratory.

Substitution Scores and Substitution Matrices:

The fact that each of the 20 standard protein amino acids has its own unique properties means that the likelihood of the substitution of each particular residue for another residue during evolution should be different. Generally, the more similar the physico-chemical properties of two residues, the greater is the chance that the substitution will not have an adverse effect on the protein’s function and, accordingly, on the organism’s fitness.

Hence, in sequence comparisons, such a substitution should be penalized less than a replacement of amino acid residue with one that has dramatically different properties. This is an oversimplification, because the effect of a substitution depends on the structural and functional environment where it occurs.

But, in general, we do not have a priori knowledge of the location of a particular residue in the protein structural and functional environment where it occurs, and even with such knowledge, incorporating it in a database search algorithm is an extremely complex task.

Thus, a generalized measure of the likelihood of amino acid substitutions is applied so that each substitution is given an appropriate value or score (weight) to be used in sequence comparisons. The score for a substitution between amino acids i and j can be expressed by the following intuitively plausible formula, which shows how likely is a particular substitution, given the frequencies of each the two residues in the analyzed database:

S_ij = K ln (q_ij/p_ip_j) (I)

where K is a coefficient, q_ij is the observed frequency of the given substitution, and p_i, p_j are the background frequencies of the respective residues. Obviously, here the product p_ip_j is the expected frequency of the substitution and, if q_ij = p_i p_j (S_ij = 0), the substitution occurs just as often as expected. In practice, the scores used are scaled such that the expected score for aligning a random pair of amino acid sequences is negative.

There are two fundamental ways to design a substitution score matrix, i.e. a triangular table containing 210 numerical score values for each pair of amino acids, including identities (diagonal elements of the matrix). As in many other situations in computational biology, the first approach works abolition, whereas the second one is empirical.

One ab initio approach calculates the score as the number of nucleotide substitutions that are required to transform a codon for one amino acid in a pair into a codon for the other. In this case, the matrix is obviously unique (as long as alternative genetic codes are not considered) and contains only four values, 0, 1,2, or 3.

Accordingly, this is a very coarse grain matrix that is unlikely to work well. The other ab initio approach assigns scores on the basis of similarities and differences in the physico-chemical properties of amino acids.

Under this approach, the number of possible matrices is infinite, and they may have as fine a granularity as desirable, but a degree of arbitrariness is inevitable because our understanding of protein physics is insufficient to make informed decisions on what set of properties “correctly” reflects the relationships between amino acids.

Empirical approaches, which came first, attempt to derive the characteristic frequencies of different amino acid substitutions from actual alignments of homologous protein families. In other words, these approaches strive to determine the actual likelihood of each substitution occurring during evolution. Obviously, the outcome of such efforts critically depends on the quantity and quality of the available alignments, and even now, any alignment database is far from being complete or perfectly correct.

Furthermore, simple counting of different types of substitutions will not suffice if alignments of distantly related proteins are included because, in many cases, multiple substitutions might have occurred in the same position, Ideally, one should construct the phylogenetic tree for each family, infer the ancestral sequence for each internal node, and then count the substitutions exactly. This is not practicable in most cases, and various shortcuts need to be taken.

Several solutions to these problems have been proposed, each resulting in a different set of substitution scores. The first substitution matrix, constructed by Dayhoff and Eck (1968), was based on an alignment of closely related proteins, so that the ancestral sequence could be deduced and all the amino acid replacements could be considered occuring just once.

This model was then extrapolated to account for more distant relationships, which resulted in the PAM series of substitution matrices. PAM (Accepted Point Mutaion) is a unit of evolutionary divergence of protein sequences, corresponding to one amino acid change per 100 residues.

Thus, for example, the PAM30 matrix is supposed to apply to proteins that differ, on average, by 0.3 change per aligned residue, whereas PAM250 should reflect evolution of sequences with an average of 2.5 substitution per position.

Accordingly, the former matrix should be employed for constructing alignments of closely related sequences, whereas the latter is useful in database searches aimed at detection of distant relationships. Using an approach similar to that of Dayhoff, combined with rapid algorithms for protein sequence clustering and alignment, Jones, Taylor, and Thornton produced the series of the so-called JTT matrices, which are essentially and update of the PAMS.

The PAM and JTT matrices, however, have limitations arising out of the fact that they have been derived from alignments of closely related sequences and extrapolated to distantly related ones. This extrapolation may not be fully valid because the underlying evolutionary model might not be adequate, and the trends that determine sequence divergence of closely related sequences might not apply to the evolution at larger distances.

In 1992, Steven and Jorja Henikoff developed a series of substitution matrices using conserved ungapped alignments of related proteins from the BLOCKS database. The use of these alignments offered three important advantages over the alignments used for constructing the PAM matrices.

First, the BLOCKS collection obviously included a much larger number and, more importantly, a much greater diversity of protein families than the collection that was available to Dayhoff and coworkers in the 1970’s.

Second, coming from rather distantly related proteins, BLOCKS alignments better reflected the amino acid changes that occur over large phylogenetic distances and thus produced substitution scores that represented sequence divergence in distant homologs directly, rather than through extrapolation.

Third, in these distantly related proteins, BLOCKS included only the most confidently aligned regions, which are likely to best represent the prevailing evolutionary trends. These substitution matrices, named the BLOSUM (= BLOCKS Substitution Matrix) series, were tailored to particular evolutionary distances by ignoring the sequences that had more than a certain percent identity.

In the BLOSUM62 matrix, for example, the substitution scores were derived from the alignments of sequences that had no more than 62% identity, the substitution scores of the BLOSUM45 matrix were calculated from the alignments that contained sequences with no more than 45% identity.

Accordingly, BLOSUM matrices with high numbers, such as BLOSUMSO, are best suited for comparisons of closely related sequences (it is also advisable to use BLOSUMSO for database searches with short sequences), whereas low-number BLOSUM matrices, such as BLOSUM45, are better for distant relationships.

In addition to the general purpose PAM, JTT, and BLOSUM Series, some specialized substitution matrices were developed, for example, for integral membrane proteins, but they never achieved comparable recognition.

Several early studies found the PAM matrices based on empirical data consistently resulted in greater search sensitivity than any of the ab initio matrices. An extensive empirical comparison showed that: (i) BLOSUM matrices consistently outperformed PAMs in BLAST searches and (ii) on average, BLOSUM62 performed best in the series ; this ; this matrix is currently used as the default in most sequence database searches.

It is remarkable that, so far, empirical matrices have consistently outperformed those based on theory, either physico-chemical or evolutionary. This perhaps points out that we do not yet have an adequate theory to describe protein evolution.

Statistics of Protein Sequence Comparison:

Let us consider the same protein sequence (E. coli RpsJ) as above

Query ” 1 MKVRASVKKLCRNCKIVKRDGVIRVICSAEPKHKQRQG 38

and check how many times segments of this sequence of different lengths are found in the database (we chose fragments starting from the second position in the sequence because nearly every protein in the database starts with a methionine). Not unexpectedly, we find that the larger the fragment, the smaller the number of exact matches in the database.

With the decrease in the number of database hits, the likelihood that these hits are biologically relevant, i.e. belong to homologs of the query protein, increases. Thus, 13 of the 23 occurrences of the string KVRASV and all 8 occurrences of the string KVRASVK are from RpsJ orthologs.

The number of occurrences of a given string in the database can be roughly estimated as follows. The probability of matching one amino acid residue is 1/20 (assuming equal frequencies of all 20 amino acids in the database ; this not being the case, the probability is slightly greater). The probability of matching two residues in a row is then (1/20)², and the probability of matching n residues is (1/20)ⁿ. Given that the protein database currently contains N ~ 2 ∞ 10⁸ letters, one should expect a string of n letters to match approximately N ∞ (1/20)ⁿ times.

Searching for perfect matches is the simplest but insufficient form of sequence database search. However, it is important as one of the basic steps in currently used search algorithms.

Further, the goal of a search is to find homologs, including distant homologs where only a small fraction of the amino acid residues are identical or even similar. Even in close homologs, a region of high similarity is usually flanked by dissimilar regions like in the following alignment of E. coli RpmJ with its ortholog from Vibrio cholerae.

In this example, the region of highest similarity is in the middle of the alignment, but including the less conserved regions on both sides improves the overall score.

Further along the alignment, the similarity almost disappears so that inclusion of additional letters into the alignment would not increase the overall score or would even decrease it. Such fragments of the alignment of two sequences whose similarity score cannot be improved by adding or trimming any letters, are referred to as high-scoring segment pairs (HSPs). For this approach to work, the expectation of the score for random sequences must be negative, and the scoring matrices used in database searches are scaled accordingly.

So, instead of looking for perfect matches, sequence comparisons programs actually search for HSPs. Once a set of HSPs is found, different methods, such as Smith-Waterman, FASTA, or BLAST, deal with them in different fashions.

However, the principal issue that any database search method needs to address is identifying those HSPs that are unlikely to occur by chance and, by inference, are likely to belong to homologs and to be biologically relevant. This problem has been solved by Samuel Karlin and Stephen Altschul, who showed that maximal HSP scores follow the extreme value distribution. Accordingly, if the lengths of the query sequence (m) and the database (n) are sufficiently high, the expected number of HSPs with a score of at least S is given by the formula

E = Kmn2^-λs (II)

Here, S is the so-called raw score calculated under a given scoring system, and K and λ are natural scaling parameters for the search space size and the scoring system, respectively. Normalizing the score according to the formula:

S’ = (λS – ln K)/ln 2 (III)

gives the bi score, which has a standard unit accepted in information theory and computer science. Then,

E = mn2^-S’ (IV)

and, since it can be shown that the number of random HSPs with score _ S’ is described by Poisson distribution, the probability of finding at least one HSP with bit score _ S’ is

P = 1 – e^-E ( V)

Equation (V) links two commonly used measures of sequence similarity, the probability (P-value) and expectation (E-value). For example, if the score S is such that three HSPs with this score (or greater) are expected to be found by chance, the probability of finding at least one such HSP is (1 –e^-3), ~ 0.95.

By definition, P-values vary from 0 to 1, whereas E-values can be much greater than 1. The BLAST programs report E- values, rather than P-values, because E-values of, for example, 5 and 10 are much easier to comprehend than P-values of 0.993 and 0.99995. However, for E < 0.01, P-value and E-value are nearly identical.

The product mn defines the search space, a critically important parameter of any database search. Equations (II) and (IV) codify the intuitively obvious notion that the larger the search space, the higher the expectation of finding an HSP with a score greater than any given value. There are two corollaries of this that might take some more time in getting used to: (i) the same HSP may come out statistically significant in a small database and not significant in a large database ; with the natural growth of the database, any given alignment becomes less and less significant (but by no means less important because of that) and (ii) the same HSP may be statistically significant in a small protein (used as a query) and not significant in a large protein.

Clearly, one can easily decrease the E-value and the P-value associated with the alignment of the given two sequences by lowering n in equation (II), i.e. by searching a smaller database. However, the resulting increase in significance is false, although such a trick can be useful for detecting initial hints of subtle relationships that should be subsequently verified using other approaches.

It is the experience of the author that the simple notion of E (P)-value is often misunderstood and interpreted as if these values applied just to a single pairwise comparison (i.e., if an E-value of 0.001 for an HSP with score S is reported, then, in a database of just a few thousand sequences, one expects to find a score > S by chance).

It is critical to realize that the size of the search space is already factored in these E-values, and the reported value corresponds to the database size at the time of search (thus, it is certainly necessary to indicate, in all reports of sequence analysis, which database was searched, and desirably, also on what exact date).

The Karlin-Altschul statistiscs has been rigorously proved to apply only to sequence alignments that do not contain gaps, whereas statistical theory for the more realistic gapped alignments remains an open problem. However, extensive computer simulations have shown that these alignments also follow the extreme value distribution to a high precision ; therefore, at least for all practical purposes, the same statistical formalism is applicable.

Protein Sequence Complexity: Compositional Bias:

The existence of a robust statistical theory of sequence comparison, in principle, should allow one to easily sort search results by statistical significance and accordingly assign a level of confidence to any homology identification. However, a major aspect of protein molecule organization substantially complicates database search interpretation and may lead to gross errors in sequence analysis.

Many proteins, especially in eukaryotes, contain low (compositional) complexity regions, in which the distribution of amino acid residues is non-random, i.e. deviates from the standard statistical model. In other words, these regions typically have biased amino acid composition, e.g. are rich in glycine or proline, or in acidic or basic amino acid residues.

The notion of compositional complexity was encapsulated in the SEG algorithm and the corresponding program, which partitions protein sequences into segments of low and high (normal) complexity.

Low-complexity regions represent a major problem for database searches. Since the X parameter of equation (II) is calculated for the entire database, Karlin-Altschul statistics breaks down when the composition of the query or a database sequence or both significantly deviates from the average composition of the database.

The result is that low-complexity regions with similar composition (e.g. acidic or basic) often produce “statistically significant” alignments that have nothing to with homology and are completely irrelevant. The SEG program can be used to overcome this problem in a somewhat crude manner: the query sequence, the database, or both can be partitioned into normal complexity and low-complexity regions, and the latter are masked (i.e. amino acid symbols are replaced with the corresponding number of X’s).

For the purpose of a database search, such filtering is usually done using short windows so that only the segments with a strongly compositional bias are masked. Low-complexity filtering has been indispensable for making database search methods, in particular BLAST, into reliable tools.

Without masking low-complexity regions, false results would have been produced for a substantial fraction of proteins, especially eukaryotic ones (an early estimate held that low-complexity regions comprise ~ 15% of the protein sequences in the SWISS- PROT database). These false results would have badly polluted any large-scale database search, and the respective proteins would have been refractory to any meaningful sequence analysis.

For these reasons, for several years, SEG filtering had been used as the default for BLAST searches to mask low-complexity segments in the query sequence. However, this procedure is not without its drawbacks. Not all low-complexity sequences are captured, and false-positives still occur in database searches.

The opposite problem also hampers database searches for some proteins when short low-complexity sequences are parts of conserved regions. In such cases, statistical significance of an alignment may be underestimated, at times unbelievably.

In a recent work of Alejandro Schaffer and colleagues, a different, less arbitrary approach for dealing with compositionally biased sequences was introduced. This method, called composition-based statistics, recalculates the λ parameter and, accordingly, the E values for each query and each database sequence, thus correcting the inordinately low (“significant”) E-values for sequences with similarly biased amino acid composition. This improves the accuracy of the reported E-values and eliminates most false-positives.

Gene Identification and Software Tools:

As discussed in the previous section, recognizing genes in the DNA sequences remains one of the most pressing problems in genome analysis. Several different approaches to gene prediction have been developed, and there are several popular programs that are most commonly used for this task: (i) Some tools performs gene prediction ab initio, relying only on the statistical parameters in the DNA sequence for gene identification, (ii) Alternatively, homology-based methods rely primarily on identifying homologous sequences in other genomes and/or in public databases using BLAST or Smith-Waterman algorithms. Many of the commonly used methods combine these two approaches.

The absence of introns and relatively high gene density in most genomes of prokaryotes and some unicellular eukaryotes provides for effective use of sequence similarity searches as the first step in genome annotation. Genes identified by homology can be used as the training set for one of the statistical members for gene recognition, and the resulting statistical model can then be used for analyzing the remaining parts of the genome.

In most eukaryotes, the abundance of introns and long intrgenic regions makes it difficult to use homology-based methods as the first step unless, of course, one can rely on similarity between several closely related genomes (e.g. human, mouse, and rat). As a result, gene prediction for genome sequences of multicellular eukaryotes usually starts with ab initio methods, followed by similarity searches with the initial exon assemblies.

One should remember that each of these methods has its own advantages and limitations, and none of them is perfect. A comparison of predictions generated by different programs reveals the cases where a given program performs the best and helps in achieving consistent quality of gene prediction.

Such a comparison can be performed, for example, using the TIGR Combiner program, which employs a voting scheme to combine predictions of different gene-finding programs, such as GeneMark, GlimmerM, GRAIL, GenScan, and Fgenes.

The computational tools that are most commonly used for gene prediction in large- scale genome annotation projects are described below.

GeneMark:

GeneMark was developed by Mark Borodovsky and James Mclninch in 1993. GeneMark was the first tool for finding prokaryotic genes that employed a non- homogeneous Markov model to classify DNA regions into proteincoding, non-coding, and non-coding but complementary to coding.

Like other gene prediction programs, GeneMark relies on organism-specific recognition parameters to partition the DNA sequence into coding and non-coding regions and thus requires a sufficiently large training set of known genes from a given organism for best performance.

The program has been repeatedly updated and modified and now exists in separate variants for gene prediction in prokaryotic, eukaryotic, and viral DNA sequences.

Glimmer:

Gene Locator and Interpolated Markov Modeler, developed by Steven Salzberg and colleagues at Johns Hopkins University and TIGR, is a system for finding genes in prokaryotic genomes. To identify coding regions and distinguish them from non-coding DNA, Glimmer uses interpolated Markov models, i.e. series of markov models with the order of the model increasing at each step and the predictive power of each model separately evaluated.

Like GeneMark, Glimmer requires a training set, which is usually selected among known genes, genes coding for proteins with strong database hits, and/or simply long ORFs. Glimmer is used as the primary gene finder tool at TIGR, where it has been applied to the annotation of numerous microbial genomes.

Recently, Salzberg and coworkers developed GlimmerM, a modified version of Glimmer specifically designed for gene recognition in small eukaryotic genomes, such as the malaria parasite Plasmodium falciparum.

Grail:

Gene Recognition and Assembly Internet Link, developed by Ed Uberbacher and coworkers at the Oak Ridge National Laboratory, is a tool that identifies exons, polyA sites, promoters, CpG islands, repetitive elements, and frameshift errors in DNA sequences by comparing them to a database of known human and mouse sequence elements. Exon and repetitive element prediction is also available for Arabidopsis and Drosophila sequences.

Grail has been recently incorporated into the Oak Ridge genome analysis pipeline, which provides a unified web interface to a number of convenient analysis tools.

For prokaryotes, it offers gene prediction using Glimmer and Generation programs, followed by BLASTP searches of predicted ORFs against SWISS-PROT and NR databases and a HMMer search against Pfam. There is also an option of BLASTN search of the submitted DNA sequence against a variety of nucleotide sequence databases.

For human and mouse sequences, the Oak Ridge pipeline offers gene prediction using GrailEXP and GenScan, also followed by BLASTP searches of predicted ORFs against SWISS-PROT and NR databases and a HMMer search against Pfam. Again, the user can perform BLASTN search of the submitted DNA sequence against a variety of nucleotide sequence databases, as well as search for CpG islands, repeat fragments, tRNAs, and BAC-end pairs.

GenScan:

GenScan was developed by Chris Burge and Samuel Karlin at Stanford University and is currently hosted in the Burge laboratory at the MIT Department of Biology. This program uses a complex probabilistic model of the gene structure that is based on actual biological information about the properties of transcriptional, translational, and splicing signals.

In addition, it utilizes various statistical properties of coding and non-coding regions. To account for the heterogeneity of the human genome that affects gene structure and gene density, GenScan derives different sets of gene models for genome regions with different GC content.

Its high speed and accuracy make GenScan the method of choice for the initial analysis of large (in the megabse range) stretches of eukaryotic genomic DNA. GenScan has being used as the principal tool for gene prediction in the International Human Genome Project.

GeneBuilder:

GeneBuilder performs ab initio gene prediction using numerous parameters, such as GC content, dicodon frequencies, splicing site data, CpG islands, repetitive elements, and others. It also utilizes a unique approach that is

based on evaluating relative frequencies of synonymous and non-synonymous substitutions to identify likely coding sequences.

In addition, it performs BLAST searches of predicted genes against protein and EST databases, which helps to refine the boundaries of predicted exons using the BLAST hits as guides. The program allows the user to change certain parameters, which permits interactive gene structure prediction. As a result, GeneBuilder is sometimes able to predict the gene structure with a good accuracy, even when the similarity of the predicted ORF to a homologous protein sequence is low.

Splice Site Prediction Software:

Programs for predicting intron splice sites, which are commonly used as subroutines in the gene prediction tools, can also be used as stand-alone programs to verify positions of splice sites or predict alternative splicing sites. Such programs can be particularly useful for predicting non-coding exons, which are commonly missed in the gene prediction studies. Recognition of the splice sites by these programs usually relies on statistical properties of exons and introns and on the consensus sequences of splicing signals.

Sequence Alignment and Similarity Search:

The Basic Alignment Concepts and Principal Algorithms:

The similarity searches air at identifying the homologs of the given query protein (or uncleotide) sequences in the database. In principle, the only way to identify homologs is by aligning the query sequence against all the sequences in the database (some important heuristics that allow an algorithm to skip sequences that are obviously unrelated to the query are discussed below), sorting these hits based on the degree of similarity, and assessing their statistical significance that is likely to be indicative of homology. Let’s briefly discuss alignment methods first.

It is important to make a distinction between a global (i.e. full-length) alignment and a local alignment, which includes only parts of the analyzed sequences (subsequences). Although, in theory, a global alignment is best for describing relationships between sequences, in practice, local alignments are of more general use for two reasons: (i) it is common that only parts of compared proteins are homologous (e.g. they share one conserved domain, whereas other domains are unique), and (ii) often, only a portion of the sequence is conserved enough to carry a detectable signal, whereas the rest have diverged beyond recognition. Optimal global alignment of two sequences was first implemented in the Needleman-Wunsch algorithm, which employs dynamic programming.

Later, the notion of optimal local alignment (the best possible alignment of two subsequences from the compared sequences) and the corresponding dynamic programming algorithm were introduced by Smith and Waterman. The cost of both of these is O (n²), i.e. the time and memory required to generate an optimal alignment are proportional to the product of the lengths of the compared sequences (for convenience, the sequences are assumed to be of equal length n in this notation).

Optimal alignment algorithms for multiple sequences have the O (n^k) complexity (where k is the number of compared sequences). Such algorithms for k > 3 are not feasible on any existing computers, therefore all available methods for multiple sequence alignments produce only approximations and do not guarantee the optimal alignment.

It might be useful, at this point, to clarify the notion of optimal alignment. Algorithms like Needleman-Wunsch and Smith-Waterman guarantee the optimal alignment (global and local, respectively) for any two compared sequences.

It is important to keep in mind, however, that this optimality is a purely formal notion, which means that, given a scoring function, the algorithm outputs the alignment with the highest possible score. However, the statistical significance of the alignment and its biological relevance has to be estimated separately.

For better or worse, alignment algorithms treat protein or DNA as simple strings of letters without recourse to any specific properties of biological macromolecules. Therefore, it might be useful to illustrate the principles of local alignments using a text free of biological context as an example. Let’s review the example provided at NCBI website (the alignable regions are shown in bold):

“Once upon a midnight dreary, while I pondered, weak and weary,

Over many a quaint and curious volume of forgotten lore,

While I nodded, nearly napping, suddenly there came a tapping,

As of someone gently rapping, rapping at my chamber door.

“‘Tis some visitor,” I muttered, “tapping at my chamber door—

Only this, and nothing more.”

“Presently my soul grew stronger, hesitating then no longer,

“Sir,” said I, “or Madam, truly your forgiveness I implore;

But the fact is I was napping, and so gently you came rapping,

And so faintly you came tapping, tapping at my chamber door,

That I scarce was sure I heard you”—here I opened wide the door,—

Darkness there, and nothing more.”

It is easy to see that, in the first two lines of the two stanzas, the longest common string consists of only five letters, with one mismatch:

… I pondered … (I)

… stronger…

The second lines align better, with two similar blocks separated by spacers of variable lengths, which requires gaps to be introduced, in order to combine them in one alignment:

…of forgotten ………. lore (II)

your forgive-ness I implore

In the third lines, there are common words of seven, four, and six letters, again separated by gaps:

…napping sud — den-ly there came a tapping, (III)

…napping and so gently you—came — rapping

The fourth lines align very well, with a long string of near identity at the end:

As of some one gently………………… rapping rapping at my chamber door (IV)

An d-so ………. f aintly you came tapping tapping at my chamber door

In contrast, there is no reasonable alignment between the fifth lines, except for the identical word ‘door’. Obviously, however, the fourth line of the second stanza may be aligned not only with the fourth (IV), but also with the fifth line of the first stanza:

… I muttered tapping at my chamber door (IV)

… came tapping tapping at my chamber door

Alignments (IV) and (IV’) can thus be combined to produce a multiple alignment:

…rapping rapping at my chamber door (IV’)

…tapping tapping at my chamber door

………………. tapping at my chamber door

Finally, sixth lines of the two stanzas could be aligned at their ends:

Only this- and nothing more (V)

Darkness there and nothing more

Now, which alignments actually reflect homology of the respective lines ? The alignments III, IV, IV’ (and the derivative IV”), and V seem to be relevant beyond reasonable doubt. However, are they really correct ? In particular, aligning en-ly/ently in III and ntly/ntly in IV require introducing gaps into both sequences. Is this justified ? We cannot answer this simple question without a statistical theory for assessing the significance of an alignment, including a way to introduce some reasonable gap penalties.

The treatment of gaps is one of the hardest and still unsolved problems of alignment analysis. There is no theoretical basis for assigning gap penalties relative to substitution penalties (scores). Deriving these penalties empirically is a much more complicated task than deriving substitution penalties as in PAM and BLOSUM series because, unlike the alignment of residues in highly conserved blocks, the number and positions of gaps in alignments tend to be highly uncertain.

Thus, gap penalties typically are assigned on the basis of the existing understanding of protein structure and from empirical examinations of protein family alignments: (i) deletion or insertion resulting in a gap is much less likely to occur than even the most radical amino acid substitution and should be heavily penalized, and (ii) once a deletion (insertion) has occurred in a given position, deletion or insertion of additional residues (gap extension) becomes much more likely.

Therefore a linear function:

G = a + bx, a >> b (VI)

where a is the gap opening penalty, b is the gap extension penalty, and x is the length of the gap is used to deal with gaps in most alignment methods. Typically, a = 10 and b = 1 is a reasonable choice of gap penalties to be used in conjunction with the BLOSUM62 matrix. Using these values, the reader should be able to find out whether gaps should have been introduced in alignments III and IV above.

In principle, objective gap penalties could be produced through analysis of distributions of gaps in structural alignments, and such a study suggested using convex functions for gap penalties. However, this makes alignment algorithms much costlier computationally, and the practical advantages remain uncertain, so linear gap penalties are still universally employed.

The feasibility of alignments (IV) and (IV’) creates the problem of choice: Which of these is the correct alignment ? Alignment (IV) wins because it clearly has a longer conserved region. What is, then, the origin of line 5 in the first stanza and, accordingly, of alignment (IV’) ? It is not too difficult to figure out that this is a repeat, a result of duplication of line 4 (this is what we have to conclude given that line 4 is more similar to the homologous line in the second stanza). Such duplications are common in protein sequences, too, and often create major problems for alignment methods.

We concluded that lines 3, 4, and 6 in each stanza of “Raven” are homologous, i.e. evolved from common ancestors with some subsequent divergence. In this case, the conclusion is also corroborated by the fact that we recognize the English words in these lines and see that they are indeed nearly the same and convey similar meanings, albeit differing in nuances. What about alignments (I) and (II) ? The content here tells us that no homology is involved, even though alignment (II) looks “believable”.

However, it would not have been recognized as statistically significant in a search of any sizable database. Is this similarity purely coincidental, then ? obviously, it is not. This is a case of convergence.

Most of the existing alignment methods utilize modifications of the Smith-Waterman algorithm. One recent modification is BALSA, a Bayesian local alignment algorithm that explores series of substitution matrices and gap penalty values and assesses their posterior probabilities, thus overcoming some of the shortcomings of the Smith-Waterman algorithm.

Pairwise alignment methods are important largely in the context of a database search. For analysis of individual protein families, multiple alignment methods are critical. Feng and Doolittle introduced the idea of hierarchical clustering that roughly approximates the phylogenetic tree and guides the multiple alignment.

The sequences are first compared using a fast method (e.g. FASTA, see below) and clustered by similarity scores to produce a guide tree. Sequences are hen aligned step-by-step in a bottom-up succession, starting from terminal clusters in the tree and proceeding to the internal nodes until the root is reached.

Once two sequences are aligned, their alignment is fixed and treated essentially as a single sequence with a modification of dynamic programming. Thus, the hierarchical algorithms essentially reduce the O (n^k) multiple alignment problem to a series of O (n²) problems, which makes the algorithm feasible but potentially at the price of alignment quality.

The hierarchical algorithms attempt to minimize this problem by starting with most similar sequences where the likelihood of incorrect alignment is minimal, in the hope that the increased weight of correctly aligned positions precludes errors even on the subsequent steps.

The most commonly used method for hierarchical multiple alignments is Clustal, which is currently used in the ClustalW or ClustalX variants. The T-Coffee programs is a recent modification of Clustal that incorporates heuristics partially solving these problems.

Sequence Database Search Algorithms:

Smith-Waterman:

Any pairwise sequence alignment method in principle can be used for database search in a straightforward manner. All that needs to be done is to construct alignments of the query with each sequence in the database, one by one, rank the results by sequence similarity, and estimate statistical significance.

The classic Smith-Waterman algorithm is a natural choice for such an application, and it has been implemented in several database search programs, the most popular one being SSEARCH written by William Pearson and distributed as part of the FASTA package. It is currently available on numerous servers around the world.

The major problem preventing SSEARCH and other implementations of the Smith-Waterman algorithm from becoming the standard choice for routine database searches is the computational cost, which is orders of magnitude greater than it is for the heuristic FASTA and BLAST methods.

Since extensive comparisons of the performance of these methods in detecting structurally relevant relationships between proteins failed to show a decisive advantage of SSEARCH, the fast heuristic methods dominate the field. Nevertheless, on a case- by-case basis, it is certainly advisable to revert to full Smith-Waterman search when other methods do not reveal a satisfactory picture of homologous relationship for a protein of interest. A modified, much faster version of the Smith-Waterman algorithm has been implemented in the MPSRCH program.

FASTA:

FASTA, introduced in 1988 by William Pearson and David Lipman, was the first database search program that achieved search sensitivity comparable to that of Smith-Waterman but was much faster. FASTA looks for biologically relevant global alignments by first scanning the sequence for short exact matches called “words” ; a word search is extremely fast.

The idea is that almost any pair of homologous sequences is expected to have at least one short word in common. Under this assumption, the great majority of the sequences in the database that do not have common words with the query can be skipped without further examination with a minimal waste of computer time. The sensitivity and speed of the database search with FASTA are inversely related and depend on the “k-tuple” variable, which specifies the word size ; typically, searches are run with k = 3, but, if high sensitivity at the expense of speed is desired, one may switch to k = 2.

Subsequently, Pearson introduced several improvements to the FASTA algorithm, which are implemented in the FASTA3 program.

BLAST:

Basic Local Alignment Search Tool (BLAST ) is the most widely used method for sequence similarity search ; it is also the fastest one and the only one that relies on a complete, rigorous statistical theory.

Like FASTA and in contrast to the Smith-Waterman algorithm, BLAST employs the word search heuristics to quickly eliminate irrelevant sequences, which greatly reduces he search time. The program initially searches for a word of a given length W (usually 3 amino acids or 11 nucleotides) that scores at least T When compared to the query using a given substitution matrix.

Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding exceeding the threshold of S. The W and T parameters dictate the speed and sensitivity of the search, which can thus be varied by the user.

The original version of BLAST (known as BLAST 1.4) produced only ungapped local alignments, for which rigorous statistical theory is available. Although this program performed well for many practical purposes, it repeatedly demonstrated lower sensitivity than the Smith-Waterman algorithm and the FASTA program, at least when run with the default parameters. The new generation of BLAST makes alignments with gaps, for which extensive simulations have demonstrated the same statistical properties as proved for ungapped alignments.

The BLASTX, TBLASTN, and TBLASTX programs are used when either the query or the database or both are uncharacterized sequences and the location of protein-coding regions is not known. These programs translate the uncleotide sequence of the query in all six possible frames and run a protein sequence comparison analogous to that in BLASTP.

A version of gapped BLAST, known as WU-BLAST, with a slightly different statistical model, which, in some cases, may lead to a greater search sensitivity, is supported by Waren Gish at Washington University in St. Louis. Recently, the BLAST suite was supplemented with BLAST2 sequences, a tool for comparing just two nucleotide or protein sequences.

Because of its speed, high selectivity, and flexibility, BLAST is the first choice program in any situation when a sequence similarity search is required, and importantly, this method is used most often as the basis for genome annotation. Therefore, we may consider the practical aspects of BLAST use in some detail. Before that, however, we need to introduce some additional concepts that are critical for protein sequence analysis.

Motifs, Domains, and Profiles:

Protein Sequence Motifs and Methods for Motif Detection:

Often we have a very general question: What distinguishes biologically important sequence similarities from spurious ones ? By looking at just one alignment of the query and its database hit showing more or less scattered identical and similar residues, it might be hard to tell one from the other.

However, as soon as we align more homologous sequence, particularly from distantly related organisms, we will have a clue as to the nature of the distinction. The constellation of conserved amino acid residues associated with a particular function in called a sequence motif. Typically, motifs are confined to short stretches of protein sequences, usually spanning 10 to 30 amino acid residues.

The notion of a motif, arguably one of the most important concepts in computational biology, was first explicitly introduced by Russell Doolittle in 1981. The following year, John Walker and colleagues described probably the most prominent sequence motif in the entire protein universe, the phosphate-binding site of a vast class of ATP/GTP-utilizing enzymes, which has now been named P-loop. Discovery of sequence motifs characteristic of a vast variety of enzymatic and binding activities of proteins proceeded first at an increasing and then, apparently, at a steady rate, and the motifs, in the form of amino acid patterns, were swiftly incorporated by Amos Bairoch in the PROSITE database.

There are two strictly conserved residues in P-loop and two positions were one of two residues is allowed. By running this pattern against the entire protein sequence database using, one immediately realizes just how general and how useful this pattern is.

Indeed, such a search retrieves sequences of thousands of experimentally characterized ATPases and GTPases and their close homologs. However, only about one-half of the retrieved sequences are known or predicted NTPases of the P-loop class, whereas the rest are false-positives. This is not surprising given the small number of residues in this pattern, which results in the probability of chance occurrence of about

(1/10) (1/20) (1/20) (1/10) = 2.5 x 10^-5

With the current database size of about 3.2 x 10⁸ residues, the expected number of matches is about 8,000!

This simple calculation shows that this and many other similar patterns, although they include the most conserved amino acid residues of important motifs, are insufficiently selective to be good diagnostic tools. Still, this does not solve the problem of motif identification. Obviously, not even a single amino is conserved across all the protein homologues.

Given this lack of strict conservation of amino acid residues in an enzymatic motif, this trend is even more pronounced in motifs associated with macromolecular interactions, in which invariant residues are the exception rather than the norm. Pattern search remains a useful first-approximation method for motif identification, especially because a rich pattern collection, PROSITE (see 3.2.1), can be searched using a rapid and straightforward program like SCANPROSITE. However, by the very nature of the approach, patterns are either insufficiently selective or too specific and, accordingly, are not adequate descriptions of motifs.

The way to properly capture the information contained in sequence motifs is to represent them as amino acid frequency profiles, which incorporate the frequencies of each of the 20 amino acid residues in each position of the motif.

Even in the absence of invariant residues, non-randomness of a motif may be quite obvious in a profile representation. Utilization of frequency profiles for database searches had a profound effect on the quality and depth of sequence and structure analysis. The principles and methods that made this possible are discussed in the next section.

Protein Domains, PSSMs, and Advanced Methods for Database Search:

Sequence motifs are extremely convenient descriptors of conserved, functionally important short portions of proteins. However, motifs are not the natural units of protein structure and evolution. Such distinct units are protein domains. In structural biology, domains are defined as structurally compact, independently folding parts of protein molecules.

In comparative genomics and sequence analysis in general, the central, “atomic” objects are parts of proteins that have distinct evolutionary trajectories, i.e. occur either as standalone proteins or as parts of variable domain architectures (we refer to the linea- r order of domains in protein sequences as domain or multi domain architecture), but are never split into parts. Very often, probably in the majority of cases, such units of protein evolution exactly correspond to structural domains.

However, in some groups of proteins, an evolutionary unit may consist of two or more domains. On rare occasions, a domain consists of a single motif, as in the case of AT-hooks, but much more often, domains are relatively large, comprising 100 to 300 amino acid residues and including two or more distinct motifs. Motifs are highly conserved patches in multiple alignments of domains that tend to be separated by regions of less pronounced sequence conservation and often of variable length.

The notion of protein motifs has been employed directly in algorithms that construct multiple sequence alignments as a chain of motifs separated by unaligned regions. The first of such methods, Multiple Alignment Construction and Analysis Workbench (MACAW), originally used a BLAST-like method for approximately delineating conserved sequence blocks (motifs) and then allowed the user to determine whether inclusion of additional alignment columns increases the significance of the block alignment. MACAW is a very convenient, accurate, and flexible alignment tool ; however, the algorithm is O(n^k) and, accordingly, becomes prohibitively computationally expensive for a large number of sequences. MACAW is an interactive tool that embodies the important notion that completely automatic methods are unlikely to capture all important motifs in cases of subtle sequence conservation, particularly in proteins that substantially differ in length.

For many occasions, it remains the method of choice when careful alignment analysis is required, although, in the current situation of explosive growth of sequence data, the computational cost severely limits MACAW’s utility. Subsequently, Charles Lawrence, Andrew Neuwald, and co-workers adapted the Gibbs sampling strategy for motif detection and developed the powerful (if not necessarily user-friendly) PROBE method that allows delineation of multiple, subtle motifs in large sets of sequences. Importantly, Gibbs sampler in an O(n) algorithm, which allows analysis of large numbers of sequences. Gibbs sampling has been incorporated into MACAW as one of the methods for conserved block detection.

In principle, this should enable MACAW to efficiently align numerous sequences. In practice, the authors find it problematic to identify relevant motifs among the numerous blocks detected by Gibbs sampler.

Arguably, the most important methodological advance based on the concepts of domains and motifs was the development of position-specific weight matrices (PSSMs) and their use in database searches as an incomparably more powerful substitute for regular matrices, such as BLOSUMs and PAMs. A PSSM is a rectangular table, which consists of n columns (n is the number of positions in the multiple alignment for which the PSSM is made) and 20 rows and contains, in each cell, the score (weight) for the given amino acid in the given position of the multiple alignment.

In the simplest case, this score can be the frequency of the amino acid in the given position. It is easy to realize, however, that, on most occasions, residue frequencies taken from any given alignment are unlikely to adequately describe the respective domain family. Firstly, we certainly never know the full range of family members, and moreover, there is no evidence that we have a representative set.

Therefore, if a residue is missing in a particular alignment column, this does not justify a 0 score in a PSSM. In reality, a PSSM never includes a score of exactly 0, although scores for some residues might be extremely low, and rounding sometimes may result in 0 values.

Instead, a finite score is assigned to the missing residue using so-called regularizes, i.e. various mathematical techniques that strive to derive the correct distribution of amino acids for a given position on the basis of a limited sample. It is easy to realize that the score given to a missing residue depends on two factors:t he distribution actually found in the sample of available super family members and the size of the sample.

Another aspect of PSSM construction that requires formal treatment beyond calculating and regularizing amino acid residue scores stems from the fact that many protein families available to us are enriched with closely related sequences (this might be the result of a genuine proliferation of a particular subset of a family or could be caused by sequencing bias).

Obviously, an overrepresented subfamily will sway the entire PSSMs toward detection of additional closely related sequences and hamper the performance. To overcome this problem, different weighting schemes are applied to PSSMs to down weight closely related sequences and increase the contribution of diverse ones. Optimal PSSM construction remains an important problem in sequence analysis, and even small improvements have the potential of significantly enhancing the power of database search methods.

Once a PSSM is constructed, using it in a database search is straightforward and not particularly different from using a single query sequence combined with a regular substitution matrix, e.g. BLOSUM62. The common database search methods, such as BLAST, can work equally well with a PSSM, and the same statistics apply.

A decisive breakthrough in the evolution of PSSM-based methods for database searching was the development of the Position-Specific Iterating (PSl)-BLAST program. This program first performs a regular BLAST search of a protein query against a protein database. It then uses all the hits with scores greater than a certain cut-off to generate a multiple alignment and create a PSSM, which is used for the second search iteration.

The search goes on until convergence or for a desired number of iterations. Obviously, the first PSI-BLAST iteration must employ a regular substitution matrix, such as BLOSUM62, to calculate HSP scores. For the subsequent iterations, the PSSM regularization procedure was designed in such a way that the contribution of the initial matrix to the position-specific cscores decreases, whereas the contribution of the actual amino acid frequencies in the alignment increases with the growth of the number of retrieved sequences. PSI-BLAST also employs a simple sequence-weighting scheme, which is applied for PSSM construction at each iteration.

Since its appearance in 1997, PSI-BLAST has become the most common method for in-depth protein sequence analysis. The method owes its success to its high speed (each iteration takes only slightly longer than a regular BLAST run), the ease of use (no additional steps are required, the search starts with a single sequence, and alignments and PSSMs are constructed automatically on the fly), and high reliability, especially when composition-based statistics are invoked.

Hidden Markov Models (HMMs) of multiple sequence alignments are a popular alternative to PSSMs. HMMs can be trained on unaligned sequence or pre-constructed multiple alignments and, similarly to PSI-BLAST, can be interatively run against a database in an automatic regime. A variety of HMM-based search programs are included in the HMMer2 package. HMM search is slower than PSI-BLAST, but there have been reports of greater sensitivity of HMMs. In the extensive experience of the ;anpratpru wprlers, the results of protein superfamily analysis using PSI-BLAST and HMMer2 are remarkably similar.

The availability of techniques for constructing models of protein families and using them in database searches naturally leads to a vision of the future of protein sequence analysis. The methods discussed above, such as PSI-BLAST and HMMer, start with a protein sequence and gradually build a model that allows detection of homologs with low sequence similarity to the query. Clearly, this approach can be reversed such that a sequence query is run against a pre-made collection of protein family models.

In principle, if models were developed for all protein families, the problem of classifying a new protein sequence would have been essentially solved. In addition to family classification, regular database searches like BLAST also provide information on the most closely related homologs of the query, thus giving an indication of its evolutionary affinity.

In itself, a search of a library of family models does not yield such information, but an extension of this approach is easily imaginable whereby a protein sequence, after being assigned to a family through PSSM and HMM search, is then fit into a phylogenetic tree. Searching the COG database may be viewed as a rough prototype of this approach.

Such a system seems to have the potential of largely replacing current methods with an approach that is both much faster and more informative. Given the explosive growth of sequence databases, transition to searching databases of protein family models as the primary sequence analysis approach seems inevitable in a relatively near future.

Only for discovering new domains will it be necessary to revert to searching the entire database, and since the protein universe is finite, these occasions are expected to become increasingly rare.

Presently, sequence analysis has not reached such an advanced state, but searches against large, albeit far from complete, databases of domain-specific PSSMs and HMMs have already become extremely useful approaches in sequence analysis. Pfam, SMART, and CDD are the principal tools of this type. Pfam and SMART perform searches against HMMs generated from curated alignments of a variety of proteins domains.

The CDD server compares a query sequence to the PSSM collection in the CDD using the Reversed Position-Specific (RPS)-BLAST program. Algorithmically, RPS-BLAST is similar to BLAST, with minor modifications ; Karlin-Altschul statistics applies to E-value calculation for this method. RPS-BLST searches the library of PSSMs derived from CDD, finding single-(space) or double-word hits and then performing ungapped extension on these candidate matches.

If a sufficiently high-scoring ungapped alignment is produced, a gapped extension is done, and the alignments with E-values below the cut-off are reported. Since the search space is equal to nm where n is the length of the query and m is the total length of the PSSMs in the database (which, at the time of writing, contains ~ 5,000 PSSms), RPS-BLAST is ~ 100 times faster than regular BLAST.

Pattern-Hit-Initiated BLAST (PHI-BLAST) is a variant of BLAST that searches for homologs of the query that contain a particular sequence pattern. As discussed above, pattern search often is insufficiently selective. PHI-BLAST partially rectifies this by first selecting the subset of database sequences that contain the given pattern and then searching this limited database using the regular BLAST algorithm.

Although the importance of this method is not comparable to that of PSI-BLAST, it can be useful for detecting homologs with a very low overall similarity to the query that nevertheless retain a specific pattern.

Stand-alone (non-web) BLAST. The previous discussion applied to the web version of BLAST, which is indeed most convenient for analysis of small numbers of sequences, and is, typically, the only form of database search used by experimental biologists. However, the web-based approach is not suitable for large-scale searches requiring extensive post-processing, which are common in genome analysis.

For these tasks, one has to use the stand-alone version of BLAST, which can be obtained from NCBI via ftp and installed locally under the Unix or Windows operation systems. Although the stand-alone BLAST programs do not offer all the conveniences available on the web, they do provide some additional and useful opportunities. In particular, stand-alone PSI-BLAST can be automatically run for the specified number of iterations or until convergence.

With the help of simple additional scripts, the results of stand-alone BLAST can be put to much use beyond the straightforward database search. Searches with thousands of queries can be run automatically, followed with various post-processing steps.

The BLASTCLUST program (written by Ilya Dondoshansky in collaboration with Yuri Wolf and E.V.K.), which is also available from NCBI via ftp and works only with stand-alone BLAST, allows clustering sequences by similarity using the results of an all-against-all BLAST search within an analyzed set of sequences as the input.

It identifies clusters using two criteria: (i) level of sequences similarity, which may be expressed either as percent identity or as score density (number of bits per aligned position), and (ii) the length of HSP relative to the length of the query and subject (e.g. one may require that, for the given two sequences to be clustered, the HSP (s) should cover at least 70% of each sequence). BLASTCLUST can be used, for example, to eliminate protein frangments from a database or to identify families of paralogs.

The core of NCBI’s BLAST services is BLAST 2.0 otherwise known as “Gapped BLAST”. This service is designed to take protein and nucleic acid sequences and compare them against a selection of NCBI databases.

The BLAST algorithm was written balancing speed and increased sensitivity for distant sequence relationships. Instead of relying on global alignments (commonly seen in multiple sequence alignment programs) BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990).

Therefore, BLAST is more than a tool to view sequences aligned with each other or to find homology, but a program to locate regions of sequence similarity with a view to comparing structure and function.

Choosing BLAST Parameters: Composition-Based Statistics and Filtering:

As noted above, low-complexity sequences (e.g., acidic-, basic- or proline-rich regions) often produce spurious database hits in non-homologous proteins. Currently, this problem is addressed by using composition-based statistics as the default for NCBI BLAST; filtering with SEG is available as an option but is turned off by default. As shown in large-scale tests, composition-based statistics eliminates spurious hits for all but the most severe cases of low sequence complexity

Expect Value, Word Size, Gap Penalty, Substitution Matrix:

Expect (E) value can be any positive number; the default value is 10. Obviously, it is the number of matches in the database that one should expect to find merely by chance. Typically, there is no reason to change this value. However, in cases when extremely low similarity needs to be analyzed, the threshold may be increased (e.g. to 100) and, conversely, when it is desirable to limit the size of the output, lower E-values may be used.

Word size (W) must be an integer; the default values are 3 for protein sequences and 11 for nucleotide sequences. This parameter determines the length of the initial seeds picked up by BLAST in search of HSPs. Currently supported values for the protein search are only 3 and 2. Changing word size to 2 increases sensitivity but considerably slows down the search. This is one of the last resorts for cases when no homologs are detected for a given query with regular search parameters.

BLASTN has the default word size of 11, i.e. reports as an HSP only a run of 11 identical nucleotides. Even decreasing the word size to 7, the lowest word size currently allowed for BLASTN, would not change the result if the longest stretch of identical nucleotides in this alignment is only 6 bases long.

This example not only shows once more why protein searches are superior to DNA-DNA searches. It also demonstrates that establishing that two given sequences are not homologous requires as much caution as proving that they are homologous.

Accordingly, the statement that the reported sequence is “novel” and has no homologs in GenBank, often found in scientific literature, should always be treated with a healthy does of skepticism.

As described above, different amino acid substitution matrices are tailored to detect similarities among sequences with different levels of divergence. However, a single matrix, BLOSUM62, is reasonably efficient over a broad range of evolutionary change, so that situations when a matrix change is called for are rare.

For particularly long alignments with very low similarity, a switch to BLOSUM45 may be attempted, but one should be aware that this could also trigger an increase in the false-positive rate. In contrast, PAM30, PAM70, or BLOSUM8O matrices may be used for short queries.

Each substitution matrix should be used with the corresponding set of gap penalties. Since there is no analytical theory for calculating E-values for gapped alignments, parameters of equation II had to be determined by extensive computer simulations separately for each combination of a matrix, gap opening penalty, and gap extension penalty.

Therefore, only a limited set of combinations is available for use. However, there is no indication that substantial changes in these parameters would have a positive effect on the search performance.

A useful feature that has been recently added to NCBI BLAST is the ability to save and bookmark the URL with a particular BLAST setup using the ‘Get URL’ button at the bottom of the page. For a habitual BLAST user, it pays off to save several setups customized for different tasks.

Running BLAST and Formatting the Output:

A BLAST search can be initiated with either a GI number or the sequence itself. In the current implementation at the NCBI web page, the user can run a BLAST search and then try several different ways of formatting the output. The default option involves toggling between two windows, which may become confusing ; it may be convenient to switch to a one-window format using the Layout toggle and save the setup as indicated above.

CDD search is run by default in conjunction with BLAST. As discussed above, this search is much faster than regular BLAST and is often more sensitive. The CDD search is normally completed long before the results of conventional BLAST become available. This allows the user to inspect the CDD search output and get an idea of the domain architecture of the query protein while waiting for the BLAST results.

On many occasions, all one really needs from a database search is recognizing a particular protein through its characteristic domains architecture or making sure that a protein of interest does not contain a particular domain. In such situations, there may be no reason to even wait for the regular BLAST to finish.

The CDD search may also be run as a stand-alone program from the main BLAST page. In this mode, it is possible to change the E-value threshold for reporting domain hits (default 0.01), which can be helpful for detecting subtle relationships and new versions of known domains.

The current BLAST setup includes limitation on the number of descriptions and the number of alignments included in the output; the current defaults are 250 and 100, respectively. With the rapidly growing database size, there is often need to increase these limits in order to investigate a particular protein family. Doing so, however, will likely result in large outputs that are hard to download and navigate. Limiting the search space as outlined above could be a viable and often preferable option.

The graphical overview option allows the user to select whether a pictorial representation of the database hits aligned to the query sequence is included in the output. Although it slows loading the page, this option is essential for quick examination of the output to get an idea of the domain architecture of the query. Each alignment in the graphical view window is color-coded to indicate its similarity to the query sequence.

The Alignments views menu allows the user to choose the mode of alignment presentation. The default Pairwise alignment is the standard BLAST alignment view of the pairs between the query sequence and each of the database hits.

All other views are pseudo-multiple alignments produced by parsing the HSPs using the query as a template. Query-anchored without identities is the same view with all residues shown. Flat query- anchored with identities is a multiple alignment that allows gaps in the query sequence; residues that are identical to those in the query sequence are shown as dashes. Flat query- anchored without identities also allows gaps in the query sequence but shows all the residues.

Pairwise alignment is definitely most convenient for inspection of sequence similarities, but the “flat query-anchored without identities” option allows one to generate multiple alignments of reasonable quality that can be saved for further analysis. This option is best used with the number of descriptions and alignments (see above) limited to manageable number (typically, no more that 50).

The Taxonomy Reports option allows the user to produce a taxonomic breakdown of the BLAST output. Given that many BLAST outputs are quite large these days, this is extremely helpful, allowing one to promptly assess the phyletic distribution of the given protein family and identify homologs from distant taxa.

Formatting for PSI-BLAST:

The output of BLAST can be used as the input for PSI-BLAST. The critical parameter that is typically set before starting the initial BLAST run is inclusion threshold; the current default is E = 0.005. This parameter determines the E-value required to include a HSP into the multiple alignment that is used to construct the PSSM. Combined with composition-based statistics, the E-value of 0.005 is a relatively conservative cut-off. Spurious hits with lower E-values are uncommon: they are observed more or less as frequently as expected according to Karlin-Altschul statistics, i.e. approximately once in 200 searches.

Therefore, carefully exploring the results with higher E-values set as the inclusion threshold often allows one to discover subtle relationships that are not detectable with the default cut-off. When studying new or poorly understood protein families, we routinely employ thresholds up to 0.1.

In the version of PSI-BLAST, which is available on the web, each new iteration has to be launched by the user. New sequences detected in the last iteration with an E-value above the cut-off are highlighted in PSI- BLAST output. PSI-BLAST also has the extremely useful option of manually selecting or deselecting sequences for inclusion into the PSSM.

Selecting “hopeful” sequences with E-values below the cut-off may help in a preliminary exploration of an emerging protein family ; deselecting sequences that appear to be spurious despite E-values above the cut off may prevent corruption of the PSSM. The PSSM produced by PSI-BLAST at any iteration can be saved and used for subsequent database searches.

We realize that the above recommendation to investigate results that are not reported as statistically significant is a call for controversy. However, we believe there are several arguments in favour of this approach. First, such analyses of subtle similarities have repeatedly proved useful, including the original test of PSI-BLAST effectiveness. Second, like in other types of research, what is really critical is the original discovery.

Once one gets the first glimpse of what might be an important new relationship, statistical significance often can be demonstrated using a combination of additional methods. Third, we certainly do not advocate lowering the statistical cut-off for any large-scale searches, let alone automated searches. This is safe only when applied in carefully controlled case studies.

Analysis and Interpretation of BLAST Results:

In spite of the solid statistical foundation, including composition-based statistics, BLAST searches inevitably produce both false positives and false negatives. The main cause for the appearance of false positives, i.e. database hits that have “significant” E- values but, upon more detailed analysis, turn out not to reflect homology, seems to be subtle compositional bias missed by composition-based statistics or low-complexity filtering.

The reason why false negatives are inevitable is, in a sense, more fundamental: in many cases, homologs really have low sequence similarity that is not easily captured in database searches and, even if reported, may not cross the threshold of statistical significance. In an iterative procedure like PSI-BLAST, both the opportunities to detect new and interesting relationships and the pitfalls are further exacerbated.

Beyond the (conceptually) straightforward issues of selectivity and sensitivity, functional assignments based on database search results require careful interpretation if we want to extract the most out of this type of analysis while minimizing the chance of false predictions. Below we consider both the issues of search selectivity and sensitivity and functional interpretation.

No cut-off value is capable of accurately partitioning the database hits for a given query into relevant ones, indicative of homology, and spurious ones. By considering only database hits with very high statistical significance (e.g. E < 10¹⁰) and applying composition-based statistics, false positives can be eliminated for the overwhelming majority of queries, but the price to pay is high: numerous homologs, often including those that are most important for functional interpretation, will be missed.

This brief discussion certainly cannot cover all “trade secrets” of sequence analysis. However, the above seems to be sufficient to formulate a few rules of thumb that help a researcher to extract maximal amount of information from database searches while minimizing the likelihood of false “discoveries”.

Some Finer Points are Noted Below:

1. Searching a domain library is often easier and more informative than searching the entire sequence database. However, the latter yields complementary information and should not be skipped if details are of interest.

2. Varying the search parameters, e.g. switching composition-based statistics on and off, can make a difference.

3. Using subsequences, preferably chosen according to objective criteria, e.g. separation from the rest of the protein by a low-complexity linker, may improve search performance.

4. Trying different queries is a must when analyzing protein (super) families.

5. Even hits below the threshold of statistical significance often are worth analyzing, albeit with extreme care.

6. Transferring functional information between homologs on the basis of a database description alone is dangerous. Conservation of domain architectures, active sites, and other features needs to be analyzed (hence automated identification of protein families is difficult and automated prediction of functions is extremely error-prone).

Bioinformatics for Learning the Intricacies of Biodiversity:

A domain is the smallest unit of evolution by the definition from the SCOP (Murzin et al., 1995) database of known protein structures. Small proteins consist of a single domain, and some larger proteins consist of more than one domain. A part of a protein is only considered a domain in its own right if it is observed else where in nature on its own or in combination with different partner domains. Domains with structural, functional and sequence evidence for a common evolutionary ancestor are classified within the same superfamily in SCOP.

The domain architecture of a protein is described by the order of the domains and the superfamilies’ to which they belong. The repertoire of architectures present in the genomes has arisen by the duplication and recombination (Miyata and Suga, 2001 ; Ohno, 1970) of the ancestral superfamily domains (Chothia et al., 2003 ; Qian et al., 2001), often forming larger multi-domain proteins (Rossmann et al., 1974).

Recently, Julian Gough (2005), in a study raised the primary question as to what extent the architectures observed in the genomes are due to functional necessity or due to evolutionary descent, i.e. to what extent genes’ stringent selective requirements have led to identical architectures on multiple occasions. Convergent evolution is defined here as more than one independent evolutionary event (recombination) leading to the same domain architecture in different genomes.

If the shuffling of domains is functionally driven then we expect to find a great deal of evidence of convergent evolution, since the same architecture would be arrived at independently in several different genomes. A failure to detect convergent evolution points to evolutionary descent being the explanation for the observed presence of architectures in the genomes.

Their findings include the fact that between 0.4 and 4% of sequences are involved in convergent evolution of domain architectures, and expect the actual number to be close to the lower bound. They further observed that the events leading to convergent evolution appear to be random with no functional or structural preferences, and changes in the number of tandem repeat domains occur more readily than changes which alter the domain composition. Hence, their principal conclusion is that the observed domain architectures of the sequences in the genomes are driven by evolutionary descent rather than functional necessity.