Read this article to learn about the nature, concept and types of gene.
The different types of gene are: (1) Pseudo-Genes (2) Overlapping Genes (3) Multi-Gene Families (4) Split Genes and (5) Cryptic Genes.
Nature of Gene:
The characteristic of every organism is governed by factors of inheritance termed as genes. As centuries have passed the scientists have learned more and more about the nature of inheritance, and our concept of gene has undergone a remarkable evolution. Over the following century, these hereditary factors were shown to reside on chromosomes and consist of deoxyribonucleic acid; DNA, a macromolecule with extraordinary properties.
Out of the total DNA in the genome of an organism about 25% of DNA define genes, but only 1.5% codes directly for proteins, the rest is made up of RNA genes and non-coding sequences, which often either serve no function or its function is still unknown.
Possibly the largest part of genome (over 50% with larger eukaryotes) is not transcribed and is partially functionless. In this brief introduction of gene, our aim is to make it clear, what is a gene in physical sense, a functional sense, how it makes proteins or governs a character.
These questions are important to understand how gene functions, and to understand that different components like regulator, promoter, structural gene including start and stop signals make a gene functional (Fig. 2.1) and differentiate it from a simple DNA fragment or polynucleotide chain. Then only, we can obtain a gene, make copies (cloning) and transfer to another organism (transgenic) with a promoter which can regulate expression.
Details of historical developments are given to understand the evolution of science of genetics and our understanding about functioning of genes. The gene has been defined as the unit of genetic information that controls a specific aspect of the phenotype. Such a description, though accurate, does not provide a precise, unambiguous definition that can be used to identify a gene at the molecular level.
Definition 1860s-1900s: Gene as a Discrete Unit of Heredity:
The concept of the “gene” has evolved and become more complex since it was first proposed. There are various definitions of the term, although common initial descriptions include the ability to determine a particular characteristic of an organism and the heritability of this characteristic. In particular, the word gene was first used by Wilhelm Johannsen in 1909, based on the concept developed by Gregor Mendel in 1866.
Definition 1910s: Gene as a Distinct Locus:
In the next major development, the American geneticist Thomas Hunt Morgan and his students were studying the segregation of mutations in Drosophila melanogaster. They were able to explain their data with a model that genes are arranged linearly, and their ability to crossover is proportional to the distance that separated them. The first genetic map was created in 1913 and Morgan and his students published The Mechanism of Mendelian Inheritance in 1915.
To the early geneticists, a gene was an abstract entity whose existence was reflected in the way phenotypes were transmitted between generations. The methodology used by early geneticists involved mutations and recombination, so the gene was essentially a locus whose size was determined by mutations that inactivated (or activated) a trait of interest and by the size of the recombining regions. The fact that genetic linkage corresponded to physical locations on chromosomes was shown later, in 1929, by Barbara McClintock, in her cytogenetic studies on maize.
Definition 1940s: Gene as a Blueprint for a Protein:
Beadle and Tatum (1941), who studied Neurospora metabolism, discovered that mutations in genes could cause defects in steps in metabolic pathways. This was stated as the “one gene, one enzyme” view, which later became “one gene, one polypeptide.”
Definition 1950s: Gene as a Physical Molecule:
The fact that heredity has a physical, molecular basis was demonstrated by the observation that X rays could cause mutations. Griffith’s (1928) demonstration that something in virulent but dead Pneumococcus strains could be taken up by live non-virulent Pneumococcus and transform them into virulent bacteria was further evidence in this direction. It was later shown that this substance could be destroyed by the enzyme Dnase (Avery et al. 1944).
In 1955, Hershey and Chase established that the substance actually transmitted by bacteriophage to their progeny is DNA and not protein. A practical view of the gene was that of the cistron, a region of DNA that in Trans could not genetically complement each other.
Definition 1960s: Gene as Transcribed Code:
It was the solution of the three-dimensional structure of DNA by Watson and Crick in 1953 that explained how DNA could function as the molecule of heredity. Base pairing explained how genetic information could be copied, and the existence of two strands explained how occasional errors in replication could lead to a mutation in one of the daughter copies of the DNA molecule. From the 1960s on, molecular biology developed at a rapid pace.
The RNA transcript of the protein-coding sequences was translated using the genetic code (solved in 1965 by Nirenberg and co-workers into an amino acid sequence. Francis Crick (1958) summarized the flow of information in gene expression (Fig. 2.2) as from nucleic acid to protein (the beginnings of the “Central Dogma”).
Definition 1970s-1980s: Gene as Open Reading Frame (ORF) Sequence Pattern:
The development of cloning and sequencing techniques in the 1970s, combined with knowledge of the genetic code, revolutionized the field of molecular biology by providing a wealth of information on how genes are organized and expressed. The first gene to be sequenced was from the bacteriophage MS2, which was also the first organism to be fully sequenced. The parallel development of computational tools led to algorithms for the identification of genes based on their sequence characteristics.
In many cases, a DNA sequence could be used to infer structure and function for the gene and its products. This situation created a new concept of the “nominal gene,” which is defined by its predicted sequence rather than as a genetic locus responsible for a phenotype. The identification of most genes in sequenced genomes is based either on their similarity to other known genes, or the statistically significant signature of a protein-coding sequence. In many cases, the gene effectively became identified as an annotated ORF in the genome.
Definition 1990s-2000s: Annotated Genomic Entity, Enumerated in the Databanks:
The current definition of a gene used by scientific organizations that annotate genomes still relies on the sequence view. Thus, a gene was defined by the Human Genome Nomenclature Organization as “a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology”.
Recently, the Sequence Ontology Consortium reportedly called the gene a “locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”.
A Current Computational Image:
Genes as “subroutines” in the genomic operating system:
Given that counting genes in the genome is such a large-scale computational attempt and that genes fundamentally deal with information processing, the dictionary of computer science naturally has been increasingly applied to describing them.
In particular, people in the computational biology community have used the description of a formal language to describe the structure of genes in very much the same way that grammars are used to describe computer programs with a precise syntax of upstream regulation, exons, and introns. Moreover, one image that is increasingly popular for describing genes is to think of them in terms of subroutines in a huge operating system (OS).
That is, insofar as the nucleotides of the genome are put together into a code that is executed through the process of transcription and translation, the genome can be thought of as an operating system for a living being. Genes are then individual subroutines in these overall systems that are repetitively called in the process of transcription.
Concepts of Gene:
1. One Gene-One Enzyme:
George Beadle and Edward Tatum conducted some experiments on Neurospora crassa. They saw that this pink bread mold can grow on medium containing only inorganic salts, a simple sugar, and one vitamin (biotin) and were called as minimal medium. This shows that these are capable of synthesizing all the other essential metabolites, such as purines, pyrimidine’s, amino acids, and other vitamins, de novo required for the growth.
Furthermore, they reasoned that the biosynthesis of these growth factors must be under genetic control. And mutations in genes whose products are involved in the biosynthesis of essential metabolites would be expected to produce mutant strains.
These mutant strains have additional growth-factor requirements. Through this experiment they came to the conclusion that each mutation resulted in a requirement for one growth factor. By correlating their genetic analyses with biochemical studies of the mutant strains, they demonstrated in several cases that one mutation resulted in the loss of one enzyme activity.
This work, for which Beadle and Tatum received a Nobel Prize in 1958, was soon verified by similar studies of many other organisms in many laboratories. The one gene-one enzyme concept thus became a central part of molecular genetics.
2. One Gene-One Polypeptide:
Subsequent to the work of Beadle and Tatum, many enzymes and structural proteins were shown to be Heteromultimeric, that is, they contain two or more different polypeptide chains, with each polypeptide encoded by a separate gene. For example, in E. coli, the enzyme tryptophan synthetize is a heterotetramer composed of two a polypeptides encoded by the trpA gene and two b polypeptides encoded by the trpB gene.
Similarly, the hemoglobin, which transport oxygen from our lungs to all other tissues of our bodies, are tetrameric proteins that contain two a-globin chains and two b-globin chains, as well as four oxygen-binding heme groups. Other enzymes, for example, E. coli DNA polymerase III and RNA polymerase II, contain many different polypeptide subunits, each encoded by a separate gene. Thus the one gene-one enzyme concept was modified to one gene-one polypeptide.
3. Central Dogma:
By the mid 1950s DNA was established as the genetic material, its structure had been analyzed by Francis Crick and James Watson (1953). Crick had stated the ‘Central Dogma’ of molecular biology that the linear sequence of nucleotides in a segment of a DNA molecule determines the linear sequence of nucleotides in an RNA molecule (termed as transcription), while RNA molecule in turn determines the sequence of amino acids in a protein by ‘informational specificity’ (termed as translation; Fig. 2.3). Whole process is known as information flow from DNA to protein or central dogma.
This is one of the functions of DNA molecule (the other is replication). This principle of flow of information from DNA to protein is well established. However, after the discovery of reverse transcriptase by Temin and Baltimore and the proteinaceous infectious particles called as Prions, scientists are trying to find evidence for complete reversed Central Dogma.
We know that there are several RNA viruses (e.g., TMV) that can produce DNA with the help of enzyme reverse transcriptase. How prions are multiplied is not known. If this is known then both steps of central Dogma will be known to function in both directions. Temin, Baltimore and Dulbecco got Noble prize of medicine in 1975.
Different Types of Gene:
1. Pseudo-Genes:
Pseudo-genes are copies of nonfunctional genes. These are defunct relatives of known genes that have lost their protein-coding ability or are otherwise no longer expressed in the cell. Although they may have some gene-like features (such as promoters, CpG islands, and splice sites), they are nonetheless considered nonfunctional, due to their lack of protein-coding ability resulting from various genetic changes (stop codons, frame shifts, or a lack of transcription) or their inability to-encode RNA (such as with rRNA pseudo genes). For example human globin gene clusters contain five genes that are no longer active. Pseudo-genes are the indication that genomes are continually undergoing change.
There are mainly two types of pseudo-genes:
(a) Conventional Pseudo-Gene:
This is a gene that is inactivated because its nucleotide sequence has changed by mutation. Many mutations have only minor effects on the activity of a gene but mutation in some gene can result in totally non-functional gene, even if the change is in only one nucleotide. Once a pseudo-gene has become nonfunctional it will degrade through accumulation of more mutations, and eventually will no longer be recognizable as an original gene product. The globin pseudo-genes are example of conventional pseudo-genes.
(b) Processed Pseudo-Genes:
It arises by an abnormal adjunct to gene expression. Cell synthesizes mRNA during transcription. These mRNA sometimes make cDNA by reverse transcription and this cDNA copy subsequently reinserts into the genome (Fig. 2.4). Because a processed pseudo-gene is a copy of an mRNA molecule, it does not contain any introns that were present in its parent gene.
It also lacks the nucleotide sequences immediately upstream of the parent gene, which is the region in which the signals used to switch on expression of the parent gene are located. The absence of these signals means that a processed pseudo-gene is inactive.
2. Overlapping Genes:
Overlapping genes are defined as a pair of adjacent genes whose coding regions are partially overlapping. In other words, a single stretch of DNA codes for portions of two separate proteins. For two genes to overlap, the signal to begin transcription for one must reside inside the second gene, whose transcriptional start site is further “upstream.”
In addition, the “stop” signal for the second gene must not be read by the ribosome during translation, using the RNA copy of the gene. This is possible because RNA is read in triplets, meaning that it can contain three separate sequences that can be “read” by the cell’s protein-making machinery. Such sequences of nucleotide triplets are called reading frames, and they are different in the RNA transcripts of the overlapping genes.
Two genes overlap in a finer manner when the same sequence of DNA is shared between two non-homologous proteins (Fig. 2.5). This situation arises when the same sequence of DNA is translated in more than one reading frame.
In cellular genes, a DNA sequence usually is read in only one of the three potential reading frames. In some viral and mitochondrial genes, however, there is an overlap between two adjacent genes that are read in different reading frames. The distance of overlap is usually relatively short so that most of the sequence representing the proteins retains a unique coding function. In some cases, a single gene may generate a variety of mRNA products that differ in their content of exons. The difference may be that certain exons are optional i.e. they may be included or spliced out.
There can also be such that one or the other is included, but not both. The alternative form produces proteins, in which one part is common and the other part is different. In some cases one exon is substituted for other. In the Fig. 2.6, the proteins produced by the two mRNAs contain sequence that overlap extensively but are different within the alternatively spliced region.
The 3′ half of the troponin T gene of rat muscle contain five exons, but only four are used to construct an individual mRNA. Three exons WXZ, are the same in both expression patterns. However, in one pattern the α exon is spliced between X and Z and in the other pattern, the β exon is used. The α and β forms of troponin T, therefore differ in the sequence of the amino acids present between sequence W and Z, depending on which of the alternative exons can be used to form an individual mRNA, but both cannot be used in the same mRNA.
Some of the benefits of overlapping genes are that they enable the production of more proteins from a given region of DNA than is possible if the genes were arranged sequentially. Indeed, for the bacteriophage PhiX174, overlapping of genes is necessary. The amount of DNA present in the circular, single-stranded DNA genome of this virus is not sufficient to encode the eleven bacteriophage proteins if transcription occurs in a linear fashion, one gene after another.
The genome economy afforded by overlapping genes extends to the human genome. The recently completed sequencing of the human genome has revealed between 30,000 and 70,000 genes. Yet evidence suggests that the human genome encodes 100,000 to 200,000 proteins. At least part of the information for the extra proteins may come from the presence of hitherto undiscovered overlapping genes, although more may come from alternative splicing of exons in a single gene.
3. Multi-Gene Families:
Eukaryotic genomes contain related groups of genes. These related gene groups, consisting of two or more genes with similar or identical DNA sequences, are called gene families. Gene families, such as the gene encoding rRNA or the histone proteins, have descended by duplication and divergence from common ancestral genes. The DNA se4quence similarity within a family can range from identical, or nearly identical, to quite different.
In fact, within a family, some sequences may have only 50% DNA sequence identical yet still be similar enough to have clearly evolved from a common ancestral gene. Most members of a gene family are clustered in close chromosomal proximity to one another; however, some are located on different chromosomes. These dispersed gene family members are presumed to have been trans-located to their different locations subsequent to, or possibly during, the process of gene duplication.
Generally, members of gene family have the same or related functions. For example, all the members of the mammalian hemoglobin gene family encode proteins whose work is to carry oxygen. However, even when members of gene family have the same basic function, they are not always expressed at the same time during development.
Different members may be expressed at different life stages and/or in different tissues, reflecting the fact that evolutionary divergence has occurred at the level of gene regulation. For example, some of the members of the mammalian gene family are expressed in adults, whereas others are expressed only at the fetal stage of development.
In some cases, with the time DNA sequence of some members of a gene family may diverge to the point that the encoded protein acquires a new function. For example, the lactalbumin gene, which encodes a subunit of the enzyme that catalyzes the synthesis of lactose, is in the same family as the gene-encoding lysozyme, an enzyme that degrades the polysaccharide component of certain bacterial cell walls. These two enzymes do have a functional harmony; however, they both act on carbohydrates.
4. Split Genes:
In some genes, the coding sequence is interrupted by the presence of non-coding (un-translated) sequences known as introns. Such genes are known as split genes and the parts of these genes that are translated are known as exons. Split genes are rare in prokaryotes, although they are more common in archae-bacteria than in eu-bacteria. These genes are commonly present in eukaryotes, but the number of such genes, and the number and size of introns per gene, increase with genome complexity. For example, chicken collagen gene has over 50 exons, the human dystrophin gene has 78 introns, and the Dscam gene in Drosophila has over 100 introns.
Furthermore, in these organisms the introns are much larger than the exons. The dystrophin gene is the most extreme known example of this: the gene has a size of 2.5Mb but the coding sequence is only 14 kb in length. The longest human intron is 480 kb and this is similar in size to the smallest bacterial genomes.
In genes that are related by evolution the exons are of similar size although the genes themselves may differ greatly in length. This means that the introns must be in the same position but can be of different sizes. Furthermore, if a split gene has been cloned it is possible to sub-clone either the exon or intron sequences. If these clones are used as probes in genomic southern blots, one can determine if these same sequences are present elsewhere in genome.
Often the exon sequence are present elsewhere in the genome. Often the exon sequences of one gene are found to be related to sequences in one or more other genes. Multiple copies of an exon also may be found in several apparently unrelated genes.
Exons that are shared by several unrelated genes are likely to encode polypeptide regions (domains) that provide the unrelated proteins with related properties, e.g., ATP or DNA binding. Some genes appear to be mosaics that were constructed by patching together copies of individual exons recruited from different genes, a phenomenon known as exon shuffling.
There is a second degree of complexity originating from split genes. After split genes are transcribed the introns are excised and the exons spliced together by a complex of snRNA and some 145 different proteins known as the splice-some. As the number of introns within a gene increases it is possible for the pre-mRNA (un-spliced messenger RNA) to be spliced in different ways.
This is known as alternate splicing and provides a mechanism for producing a wide variety of proteins from a small number of genes. For example, Drosophila’s Dscam gene contains 108 exons and alternative splicing theoretically could generate 38,016 different proteins.
5. Cryptic Gene:
Cryptic genes are the phenotypically silent DNA sequences that are not normally expressed during the life cycle of an organism. However, in a few individuals of population they may be reactivated by various genetic mechanisms like mutation, recombination and transposition. Since cryptic genes are not expressed, and thus do not contribute to fitness, it is expected that they would be permanently inactivated by accumulated mutations, and would thus be rare in populations.
Some scientists showed that, in yeast, gene function can be accurately predicted by examining the levels of various known metabolites. Using sophisticated statistical methods and NMR for complete analysis of metabolite concentrations can be determined. When the metabolic profile of yeast containing silent genes is compared with a normal yeast, we can detect the presence of cryptic genes.
The observation that cryptic genes are commonly found in microorganisms indicates that there is selection for their preservation in microbial populations. Cryptic genes differ from other silent DNA sequences, collectively referred to as ‘selfish DNA’ which spread in the genome by making additional copies of themselves, without conferring any phenotypic advantage. They also differ from pseudo-genes, which are homologous copies (often derived from mRNA due to reverse transcription) of active genes, but may carry many mutations, so that their reversion to active form is not possible.
An example of cryptic gene is that the wild-type E. coli K12 does not utilize (to grow) any β-glucoside sugars as sole carbon and energy sources. Mutations in several loci can activate the cryptic bgl operon and allow growth on the aryl β-glucosides, arbutin and salicin.