Genome comparisons are behind the powerful new annotation methods being developed to find all human genes, as well as genes from other genomes. Genomes are now frequently being studied in pairs to provide cross-comparison datasets. This 'Noah's Ark' approach often reveals unsuspected genes and may support the deletion of false-positive predictions. Joining mouse and human as the cross-comparison dataset for the first two mammals are: two Drosophila species, D. melanogaster and D. pseudoobscura; two sea squirts, Ciona intestinalis and Ciona savignyi; four yeast (Saccharomyces) species; two nematodes, Caenorhabditis elegans and Caenorhabditis briggsae; and two pufferfish (Takefugu rubripes and Tetraodon nigroviridis). Even genomes like yeast and C. elegans, which have been known for more than five years, are now being significantly improved. Methods developed for yeast or nematodes will now be applied to mouse and human, and soon to additional mammals such as rat and dog, to identify all the mammalian protein-coding genes. Current large disparities between human Unigene predictions (127,835 genes) and gene-scanning methods (45,000 genes) still need to be resolved. This will be the challenge during the next few years.
Keywords:human genome; mouse genome; Caenorhabditis elegans genome; Caenorhabditis briggsae genome; Saccharomyces genomes; comparative genomics; gene discovery; gene-prediction algorithms
Introduction and background
The monumental sequence of a composite human genome conjures up images of Arthur C. Clark's monolith in the film 2001: A Space Odyssey -- a beautiful, awe-inspiring structure with a hidden message. Researchers' ignorance is laid bare by the simple fact that they cannot, with any confidence, extract from this (human genome) structure the total number of its genes. What is needed is a Carl Sagan (SETI, Contact), an Alan Turing (WWII code breaker) or a Jean-Franc¸ois Champollion (Rosetta stone decoder) to break the codes. Or perhaps, what is really needed is a Rosetta stone for genomes: just two or three translations of the same message, laid side by side. Unfortunately, thereis not even one full translation available. James Watson put it this way in a 1992 interview:  'The goal of the Human Genome Project is to understand the genetic instructions for human beings ... Getting the instructions is a big job; understanding those instructions can consume many hundreds of years ...'.
In December 1999, an analysis of the human chromosome (Chr) 22 sequence was published; 545 protein-coding genes and 134 pseudogenes were identified . In January 2003, a reanalysis of the Chr 22 sequence by the same group reported 546 protein-coding genes and 234 pseudogenes, with an increase of 74 per cent in the total length of exons in the annotation . A third, microarray-based, study  doubled the number of Chr 22 base pairs in transcribed sequences. The National Center for Biotechnology Information (NCBI) human genome map-viewer build 34 version 1 (Nov 2003) has 673 genes on Chr 22 and an unspecified number of pseudogenes. Since the true number of genes and pseudogenes has not changed in the past four years -- it is merely researchers' ability to detect them that has improved -- how many more genes will be found and how will they be detected?
Finding protein-coding genes
The best method for documenting genes is with a full-length cDNA. Even shorter expressed-sequence tags (ESTs), if not from the same species then from a closely related species, are useful. The EST database dbEST (21st November, 2003) lists 5,427,257 Homo sapiens ESTs and 3,948,029 Mus musculus ESTs. The Unigene database clusters these ESTs into unique contigs representing 127,835 human (build 163) and 93,645 mouse transcripts. The human number is similar to the TIGR Gene Index prediction of 120,000 genes in humans .
According to the NCBI Handbook 2003, Unigene clusters may contain more than one alternative-splice form . Furthermore, Unigene clusters are required to have evidence of a 3'] terminus, to avoid forming two or more clusters from a single long gene; this restriction prevents some ESTs in dbEST from being included in Unigene. The logical interpretation  of these facts is that 'each Unigene cluster contains sequences that represent a unique gene'.
This leaves researchers with a problem. Conservative gene annotation of the human genome only identified 25,642 genes . More relaxed estimates predict about 40,000  to 45,000  genes; yet, these numbers are about threefold lower than the Unigene cluster count. At some point, these values should converge on the true number of genes -- defined as full-length, expressed messages from any cell type at any time, from germ cells to embryo to adult. Currently, this point is some way away.
By the comparative genomics approach, the mouse genome is supposed to save us from this weakness in finding genes in the human genome. By comparing mouse and human geno-mic sequences, all orthologous genes and many paralogous genes should be detectable, exon by exon. Preliminary efforts with small sets of known genes were highly successful. The ROSETTA program  identified 94 per cent of internal coding exons from 117 mouse-human orthologous gene pairs perfectly at both exon ends, and another 4 per cent at one of the two ends . It did less well for initial, plus terminal, coding exons.
Including conserved sequence elements
We now find the problem grows more complex, however, because there are thousands of non-expressed conserved sequence elements (CSEs) in the two mammals , sequences whose function we do not understand. Some are possibly promoter regions, some pseudogenes or RNA genes and some are new undocumented genes, but it is clear that this does not account for all of these sequences. Thus, the comparative genomics approach may over-predict, when viewing two mammals, since they may be phylogenetically too close. The distance between species for optimal gene identification has been studied, and mouse-human is generally good, but a mammal more distant than mouse from human might be even better .
An alternative approach has been to use fish as a more distant relative. The EXOFISH Program  compared human and Tetraodon nigroviridis (freshwater pufferfish) for conserved regions (presumably exons) and found 28,000-34,000 genes. Due to the greater evolutionary distance between human and fish, there is a cleaner background, but the many mammal-specific genes and human brain-specific genes may not be identified, so the gene number predicted by EXOFISH is almost certainly an underestimate.
Another approach is exemplified by the analysis of sequences from 12 species, all derived from a 1.8 Mb region orthologous to a human Chr 7 segment containing ten genes . In this instance, coding exons were already well documented, but substantial numbers of CSEs -- beyond those previously identified experimentally -- were discovered. This approach might be more fruitful at human gene discovery, if applied to areas of the human genome that are more poorly characterised than the Chr 7 segment chosen.
Whereas ~1.5 per cent of the human genome comprises protein-coding genes, another ~3.5 per cent of the genome contains CSEs that are more conserved than protein-coding-gene regions . Possible functions for these CSEs (termed CNGs by Dermitzakis et al . and CNSs by Inada et al .) include control regions that: (a) regulate gene expression; (b) govern developmental-, cell type- and organ-specific expression, in trans, of genes located far away; (c) lock-in regulatory decisions ; and (d) act as structural components of chromosomes when alignment and chromosome movement occurs during meiosis or mitosis. There appear to be at least twice as many CSEs than protein-coding genes in the genome. A recent comparison of 43 species -- including vertebrates, insects, worms, plants, fungi, yeast, eubacteria and archae-bacteria  -- revealed noteworthy increases in genome size and complexity from prokaryote to mammals, again emphasising the innumerable highly-conserved CSEs that are likely to have essential functions and critical effects on an organism's phenotype.
Learning from the worms
Nevertheless, the comparative approach is very powerful -- as illustrated by the recent comparative genomics study of Caenorhabditis elegans and Caenorhabditis briggsae . This study increased the signal-to-noise ratio by using four gene-prediction algorithms on each genome, comparing results between genomes and selecting the most informative dataset (Figure 1). The power of this method was proven by the prediction of 1,275 new genes in C. elegans that had not been detected in the previous five years of annotation on this genome (Table 1). Huge numbers of previously predicted C. elegans genes were also revised, due to the identification of many new exons, based on these novel findings .
Figure 1. Use of the four gene-prediction algorithms to characterise the acy-4 gene in both worm genomes. Of the 12 possible combinations of predictions, the Genefinder-Genefinder prediction pair was chosen as the best model, because this pair showed the greatest similarity to each other, excluding terminal exons. Coding sequence conservation between the two has provided evidence for as many as 12 additional N-terminal exons in the Genefinder Caenorhabditis elegans acy-4 prediction, compared with that of T01C2.1, the Worm-Base WS77 C. elegans acy-4 prediction. Four of the additional N-terminal exons (those marked with asterisks) that were predicted by FGENESH and Genefinder have subsequently been confirmed by new EST data (modified from Ref. )
Table 1. Several comparisons of Caenorhabditis briggsae, Caenorhabditis elegans WS77* and Caenorhabditis elegans hybrid
Different algorithms for predicting protein-coding genes give similar results in predicting exons but tend to disagree on the grouping of exons into genes . Four different gene-prediction programs can give four very different answers across the same region of a genome. Stein et al . used the concordance of prediction between C. elegans and C. briggsae to predict the most likely gene model -- using Genefinder (version 980506, P. Green, unpublished data, 2003; see also Ref. ), FGENESH , TWINSCAN  and the Ensembl annotation pipeline . The output of the four gene-prediction programs (Figure 1) was largely concordant with respect to the position of C. briggsae exons (80 per cent of exons predicted identically by two or more programs; 26 per cent predicted identically by all four programs), but discordant with regard to gene predictions (38 per cent of genes called identically by two or more programs; just 4 per cent called identically by all four programs). A similar pattern was seen in C. elegans .
Stein et al . termed the gene sets produced by their analysis 'hybrid gene sets', because the final gene sets are a mixture of gene prediction from multiple programs; applying a transpo-son- and pseudogene-filtering step to the WormBase 77 set, they removed 619 genes to create a 'pruned' WS77 set, termed WS77*. The constitution of the final gene sets was: C. briggsae, 19,507 genes; the C. elegans WS77*, 18,808 genes; and the hybrid C. elegans, 20,621 genes (Table 1).
Stein et al . compared the C. elegans hybrid gene set (20,621 genes) to the WS77* set (18,808 genes) derived from WormBase and derived 1,275 well-supported suggestions for new C. elegans genes, 1,763 new exons in 1,100 existing genes, 2,093 exon deletions in 1,583 genes, 1,675 exon truncations in 1,502 existing genes and 1,115 exon extensions in 1,008 existing genes. These data underscore the value of comparative genomics between total-genome sequences from two species in establishing a more accurate count of protein-coding genes.
Comparing C. elegans/C. briggsae divergence and mouse/human divergence
The two worms diverged ~100 million years ago (MYA) and the two mammals diverged ~75 MYA. Similar levels of amino acid identity exist between C. briggsae and C. elegans ortho-logues (80 per cent) and between mouse and human ortho-logues (78.5 per cent). In the mouse/human comparison, 80 per cent of predicted proteins can be assigned to a 1:1 orthologue pair, whereas <65 per cent of C. briggsae genes could be assigned a C. elegans orthologue. The protein families are thus more dynamic in the two nematodes -- several hundred either being novel or having diverged so far that their common origin cannot be recognised, and another ~200 having expanded or contracted by more than twofold. The C. briggsae/C. elegans pair is also evolving more rapidly at the nucleotide level: 1.78 synonymous substitutions per synonymous site, compared with 0.6 in the mouse/human pair .
Many of these striking differences between the two worms and the two mammals can probably be explained on the length of generation times. The generation time in the nematodes is ~3 days, compared with ~3 months and ~20 years for the mouse and human, respectively.
Approaching a stable gene count in yeast: Hope for mammals
Improved annotation does not always increase gene number. Detailed comparison of four Saccharomyces species  resulted in revision of 15 per cent of known yeast genes and a net decrease in the S. cerevisiae gene count of about 500; this is a case where 'less is more'. This illustrates the power of adding more closely related sequences to the analysis, especially since the yeast genome had been known for seven years prior to this analysis.
Tremendous progress has been made in the eight years since the baker's yeast genome sequence appeared. There is still a large gap, however, between gene predictions and Unigene clusters. This must be accounted for by improvement of comparative genomics methods such as: (a) using the ROSETTA program to include three or more species; (b) obtaining more comprehensive EST collections from mouse, rat, human and other species, possibly by purchase of these resources from private companies that have already amassed the information; and/or (c) utilising consensus prediction methods, as was done in the C. elegans/C. briggsae study . Special attention will need to be given to the first- and last-exon predictions, as well as allowance of non-canonical intron-exon boundaries (GC versus GT, etc) -- if supported by EST data. Verification of predictions by reverse transcriptase polymerase chain reaction, as was demonstrated in the study by Guigo et al . will confirm the expression of questionable genes and enhance genome annotation. One can only hope that Dr Watson's prediction of 12 years ago was a slight exaggeration.
The writing of this article was funded, in part, by NIH grant P30 ES06096 (D.W.N.). The authors very much appreciate assistance with the graphics from Dr Marian Miller.