Here is a compilation of notes on Bioinformatics. After reading these notes you will learn about: 1. Definition of Bioinformatics 2. Bioinformatics in Institutes, Web­sites, Databases, Tools 3. Bioinformatics in Industry 4. Areas.


Note # 1. Definition of Bioinformatics:

Bioinformatics is currently defined as the study of information content and information flow in biological systems and processes. It serves as the bridge between observations (data) in diverse biologically-related disciplines and the deriva­tions of understanding (information) about how the systems or processes function and sub­sequently the application (knowledge).

Though Hwa Lim, Father of Bioinformatics, coined the word ‘bio/informatique’ in 1987, but Temple Smith used the term ‘Bioinformatics’ in 1991.

In Silico Biology, a new area of Biology, has been developed in recent years because of gene­ration of data in the field of genetics at an unprecedented exponential rate; the manage­ment and use of which requires the increasing use of computers and the relevant software.

Computational Biology, another term often used interchangeably with bioinformatics, although the former typically focuses on algo­rithm development and specific computational methods, while the latter focuses more on hypothesis testing and discovery in the biologi­cal domain.

Systems Biology, another area of research, emerged due to availability of enormous amount of molecular data and bioinformatics tools crea­ting unprecedented opportunities to assemble and integrate this data into networks of genes, proteins and bio-chemical pathways.

Bioinformatics involves collection, storage, retrieval and analysis of biological data that has a lot of applications in pharmaceutical, agricultural and food industries, and in molecular genetics research.

Biological data are generated from various genome sequencing projects, obtained by different techniques like ONA sequencing (genome and EST), 2D gel electrophoresis, mass spectroscopy (MS, MALDI, LC-MS), protein crystallization, microarrays (e.g., cDNA, oligos, pep­tide), molecular markers (e.g., RFLP, RAPD, AFLP, SNP).

Thus bioinformatics is an interface of bio­logical sciences, mathematics, physical sciences and computer sciences, i.e., the integrated field of biology and information technology.


Note # 2. Bioinformatics in Institutes, Web­sites, Databases, Tools:

i. Institutes:

Major public domain bioinformatics facilities are (Fig. 19.2):

Major Publicly, Available Databases and Data Mining Tools

(a) NCBI – National Centre for Biotechno­logy Information, USA.

(b) EBI – European Bioinformatics Institute, UK.

(c) SIB – Swiss Institute of Bioinformatics, Switzerland.

(d) Genome NET (KEGG & DDBJ), Japan.

ii. Websites:

Some important websites com­monly used for bioinformatics are depicted in Table 19.1.

Websites Commonly Used for Bioinformatics

iii. Databases:

Bioinformatics is involved in storing the sequence information in different nucleic acid and protein databases which can be assessed by people all over the world through network technology.

Protein databases:

The major protein databases are:

PDB, SWISS-PROT, PROSITE, ExPASy, PIR, PRINTS, BLOCKS, PRODOM, Pfam, Inter Pro.

Nucleic acid databases:

The major nucleic acid databases are: Gen Bank, DDBJ, Ref Seq, dbEST, NDB, CSD, EMBL.

iv. Tools for Genetic Studies:

In order to deal with molecular data, a wide range of software’s are now available which facilitated analysis of data in user-friendly marther.

These tools are classified into four classes:

a. Statistical-analysis tools (Table 19.2)

b. Genome analysis tools (Table 19.3)

c. Sequence alignment tools (Table 19.4)

d. Genome annotation tools (Fig. 19.3)


Note # 3. Bioinformatics in Industry:

Bioinformatics has a great impact in agricul­ture, health-care and on environment which will bring bio-industrial revolution.

Two Major Tools for Statistical Analysis of Data for Genetic Studies

Food industry:

Functional genomics is play­ing a major role in food biotechnology industry. The complete genome sequence information available in different databases generates infor­mation that can be used for finding metabolic pathways, improving cell factories and develop­ment of novel preservation methods.

Tools for Genome Analysis

Agricultural industry:

Crops are improved by producing plants that have disease resistant genes to pathogens like fungi and bacteria. Homology searches, finding conserved motifs and molecular modelling is useful in identifying disease resistant genes. Fungicides that can efficiently kill the pathogens are designed by molecular modelling.

Pharmaceutical industry:

Chemoinformatics is playing a key role in pharmaceutical industry to design new drug targets from genomic data at a very faster rate. Disease causing genes are identified using the tools of genomics and proteomics. Drug lead identification and drug optimization became easy using the tools of genomics and proteomics.

Pharmaceutical industry is also using the sequence information in the production of vaccines and therapeutic proteins.

List of Sequence Alignment Tools and their Specific Features

Basic Outline of an AUtomatic Genome Annotation Pipeline and Delivery System


Note # 4. Areas of Bioinformatics:

i. Genomics:

The complete genetic content of an orga­nism is genome, and the DMA obtained is called genomic DNA. This genomic DNA of prokaryote contains all the coding region and can be sequenced, whereas the DNA of eukaryotes includes both intron and exon sequences (coding sequence) as well as non-coding regulatory sequences such as promoter, and enhancer sequences.

The subject genomics is the complete analysis of the entire genome of a chosen orga­nism which involves the study of physical struc­ture of the organism’s genome or the genetic makeup of an organism to know the number of genes present and the type of genes, i.e., to study the function of different genes.

Whole Genome Sequence Data:

Complete nucleotide sequences of nuclear, mitochondrial and chloroplast genomes have already been worked out in large number of prokaryotes and several eukaryotes. By the year 2005, among prokaryotes, approx. 1400 viral genomes, 250 bacterial genomes (230 eubacteria and 20 archaea), 500 mitochondrial genomes, 35 chloroplast genomes have been fully sequenced.

Among the eukaryotes namely the whole genome of Saccharomyces cerevisiae (yeast), Coenorhabditis elegans (nematode), fruitfly (Drosophila melanogaster). Human (Homo sapi­ens), Crucifer weed (Arabidopsis thaliana) and rice (Oryza sativa) have been sequenced already and data available for annotation studies.

The sequence data of eukaryotic nuclear genome is an important source of identification, discovery and isolation of important genes. This data is very much helpful in variety of applica­tion relevant to animal, plant and microbial biotechnology.

Functional and Structural Genomics:

Once the whole genome sequence becomes available, the next step is to assign the function to different regions of genome. Functional genomics is the subject which is based on the use of genetic information to delineate protein structure, func­tion, pathways and networks.

Function may be determined by ‘knocking out’ and ‘knocking in’ expressed genes in model organisms such as worm, fruitfly, yeast or mouse. Structural genomics involves solving the experimental structures of all possible protein folds which is playing an important role in high throughput function assignment.

Significance of Genomics:

All the information’s require input in probability theory, database management and manipulation, and computer science.

This will help in:

(a) Identi­fication of open reading frame sequences,

(b) Gene splicing sites (introns),

(c) Gene anno­tation (inter-genomic comparisons) and

(d) Deter­mination of sequence patterns of regulatory sites and gene regulations.

ii. Proteomics:

The entire protein component of a given organism is called ‘proteome’, the term coined by Wasinger in 1995. A proteome is a quantitatively expressed protein of a genome that pro­vides information on the gene products that are translated, amount of products and any post translational modifications.

Proteomics is an emerging area of research in the post-genomic era, which involves identifying the structures and functions of all proteins of a proteome. It is sometimes also treated as structural based func­tional genomics.

Methods of Proteome Analysis:

Resolution and identification of proteins are possible by 2D-PAGE (Polyacrylamide Gel Electrophoresis) and Mass Spectrometry; comparative 2-D gel approach or protein chip approach helps to identify the proteins in up or down regulated system.

A variety of other techniques are used for protein identification, the most common being matrix assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS). The hybrid electrospray ionization (ESI) method of quadrupole TOE-MS with its increased mass accuracy, is becoming increa­singly established.

Scope of Proteomics:

Proteomics deals with significant problems like:

(a) Identification of functional domains in protein sequences.

(b) Single, multiple protein alignment (homology).

(c) Determining sequence-structure, seque­nce-function relationships (structural bioinformatics).

(d) Discovery of protein pattern and provi­ding the framework for the analysis of signalling networks.

Reverse Genetics:

The research in proteomics has made it possible to get the knowledge of all the proteins produced in an organism which may or may not be directly responsible for any phenotypic trait, but this may be helpful to know the functions of all the genes in that organism. This has made the approach of reverse genetics feasi­ble because from the study of proteins, one can deduce the function of gene and the trait.

Significance of Proteomics:

The knowledge of proteomics is complementary to genomics and has become a major thrust area of genetics, molecular biology and biotechnology research. From the whole genome sequence, functional genes are identified as open reading frames (ORFs) having initiation and termination codon, but ORF always does not represent any functio­nal gene.

Verification of gene product by pro­teome analysis serves a very useful purpose for ‘annotation of the genome’. Post translational modification and expression of proteins, func­tions all are regulated by various activities of the cellular metabolism — but all these are also due to proteolysis or protein-protein interac­tions.

Yeast Proteome:

The complete sequence of the whole genome of yeast has been worked out in 1996, nearly 6200 genes are present in this small organism. In 2001, the functions of 93% of the proteins (5800 proteins) encoded in the genes were also elucidated. Later (2002-2005) networks involving the interactions have also been studied.

Study of yeast proteome in great details will be very useful for the study of func­tions of genes of higher organisms including human, since yeast is the simplest eukaryote.

iii. Transcriptomics:

After the genome sequences are being completed, the new questions arise about the functional roles of different genes; the cellular processes in which they participate; mechanism by which the genes regulate the interaction of genes and gene products; changes in level of gene expression in different cell types and states.

To answer all these questions, the new area of science has got emerged which is transcriptomics. The transcription of genes to produce RNA is the first stage of gene expression. Although mRNA is not the ultimate product of a gene, but it is the first step of gene regulation and information about the transcript levels which is needed for understanding gene regulatory net­works.

The transcriptome is the complete set of mRNA transcripts produced by the genome at any one time. Unlike the genome, the transcriptome is extremely dynamic, all the cells of an organism contain same genome but the transcriptome varies considerably in different cells at different circumstances due to different patterns of gene expression.

Techniques for Transcriptome Analysis:

High throughput techniques based on DNA chip/ microarray technology (i.e., cDNA microarrays, oligo microarrays), cDNA-AFLP (cDNA-amplified fragment length polymor­phism) analysis, SAGE (serial analysis of gene expression) and a new technique MPSS (massively parallel signature sequencing) are used for transcriptome analysis.

The cDNA microarray technique is based on the ability of the mRNA molecule to bind specifically to or hybridize to, its original DNA coding sequence in the form of a cDNA template spotted on an array. DNA chip is prepared on a silicon or glass based surface with regions of known sequence of chosen target DNA, which can hybridize with an unknown labelled DNA sample.

Besides using cDNA clones as probes on an array, oligonucleotides of around 20 nucleotides can also be used as probe. Microarray experiments allow for comparison of gene expression profiles between two mRNA samples (e.g., treatment vs. control, or treatment 1 vs. treatment 2).

The most important advantage of microarray-based technology is that large data sets from different experiments can be combined together in a single database, which allows gene expression profiles from either different samples or samples from different treatments to be compared with each other and analysed together.

Significance of Transcriptomics:

As the transcriptome includes all mRNA transcripts in the cell, it reflects the genes that are being actively expressed at any given time, with the exception of mRNA degradation phenomenon such as transcriptional attenuation. The study of transcriptomics examines the expression level of mRNA in a given cell population.

Many DNA sequences that have been iso­lated shown to have no known function. However, if they show similar expression pat­terns to a characterized gene, it is likely that their functions are similar. It is sometimes possible to identify conserved regulatory sequences of such genes.

Ultimately, these studies promise to expand the size of existing gene families, reveal new patterns of co-ordinated gene expression across gene families and uncover entirely new categories of genes.

Furthermore, the product of any one gene usually interacts with those of many others, therefore, transcriptomics will provide precise knowledge on coordination among genes and their inter-relationships.

It will also help to understand the integration of gene expression and function at the cellular level, revealing how multiple gene products work together to produce physical and chemical responses to both static and changing cellular needs.

iv. Metabolomics:

Genomics is concerned with the total complement of genes and proteomics, the analysis of the entire set of proteins, metabolomics has been defined as the qualitative and quantitative measurement of all low mole­cular weight metabolites in a given sample, cell or tissue and the integration of the data in the context of gene function analysis.

In the ‘post-genomic era’, the genome expression profiling methods have come up at the level of the transcriptome, proteome and the metabolome.

The comprehensive measurements of the working parts of biological systems at these different levels of organisation will allow a full and global comparison of the differences between cell types, tissues, organs and whole organisms (plants, animals and microbes) to prove unknown aspects of gene function, physiology and metabolism.

Areas of Metabolomics:

Metabolic Analysis Can Be Divided Into Four General Areas:

(a) Target Compound analysis:

The quantifica­tion of specific metabolites.

(b) Metabolic Profiling:

Quantitative and qual­itative determination of a group of related compounds or of members of specific metabolic pathways.

(c) Metabolomics:

Quantitative and qualitative analysis of ail metabolites.

(d) Metabolic Fingerprinting:

Sample classifica­tion by rapid global analysis, without exten­sive compound identification.

Techniques for Analysis of Metabolites:

As there are different kinds of metabolites available in cellular system so there is no single analytical method to detect the metabolite present in the extract.

The detection also depends on the sol­vent used to get the tissue extract. A mixture of techniques such as gas chromatography, high pressure liquid chromatography and capillary electrophoresis are used to separate different metabolites according to various chemical and physical properties.

Proton (1H) NMR can detect any metabolite containing hydrogen, gas chro­matography (GC) provides high resolution com­pound separations and can be used in conjunc­tion with a flame ionization detector (GC/FID) or a mass spectrometer (GC/MS). HPLC, with UV detection is a common method used for targeted analysis of plant materials and for metabolic pro­filing of individual classes.

LC/MS and LC/NMR are powerful instruments for the structure deter­mination. Mass analyser like Fourier transform ion cyclotron resonance instruments (FT-ICR-MS) can help to obtain the ‘mass profiles’ from the crude extract without any chromatographic separation.

Significance of Metabolomics:

Metabo­lomics is a relatively new discipline, and tech­niques for high throughput metabolic profiling are still under development. The advantage of metabolomic analysis is that the biochemical consequences of mutations, changes in the environment and treatment with drugs can be observed directly.