The $3 billion project was founded in 1990 by the U.S. Department of Energy and the U.S. National Institutes of Health, and was expected to take 15 years. Due to widespread international cooperation and advances in the field of genomics (especially in sequence analysis), as well as huge advances in computing technology, a rough draft of the genome was finished in 2000 (announced by US president Bill Clinton on June 26, 2000), two years earlier than planned. The full, high-quality genome is still being sequenced and is expected to be released in 2003.
Another reason for the accelerated work was the commercially financed HGP at Celera Genomics, which used a new method called shotgun sequencing, and also that Celera Genomics planned to patent all genes found, unlike the gene sequences found by the original government-funded HGP, which are available without cost.
Although the working draft was announced in June 2000, it was not until February 2001 that Celera and the HGP scientists published actual details of the draft. Special issues of Nature and Science contained the working draft as well as analysis which is hoped to provide a 'scaffold' of about 90% of the genome upon which gaps can be closed.
Each draft sequence has been checked at least four to five times to increase 'depth of coverage' or accuracy. Approximately 47% of the draft were high-quality sequences - the final version will have been checked eight to nine times giving an error rate of just 1 in 10,000 bases.
The human genome project is one of a number of international genome projects in biology, each aimed at sequencing the DNA of a specific organism. While the human DNA sequence offers the most tangible benefits, important developments in biology and medicine are predicted as a result of the sequencing of model organisms including mice, fruitflies, zebrafish, yeast, nematodes and many microbial organisms and parasites.
The goals of the original HGP were not only to determine all 3 billion base pairs in the human genome with a minimal error rate, but to also identify all the genes in this vast amount of data. This part of the project is still ongoing although a preliminary count indicates about 30,000 genes in the human genome, which is far less than predicted by most scientists.
Today, the sequence of the human DNA is stored in databases and is available for everyone on the Internet. Computer programs are developed to analyse that data, for the data itself is next to useless without interpretation.
The process of identifying the boundaries of genes and other features in raw DNA sequence is called annotation and is the domain of bioinformatics. While expert biologists make the best annotators, such annotation proceeds slowly, and computer programs are increasingly used to meet the high-throughput demands of genome sequencing projects. The best current technologies for annotation make use of statistical models that take advantage of parallels between DNA sequences and human language, using concepts from computer science such as formal grammars.
The work on automated interpretation on genome data has just begun. The knowledge gained by the understanding of the genome is hoped to boost the fields of medicine and biotechnology, eventually leading to cures for cancer, Alzheimers disease and other diseases.
For example, a biological researcher investigating a certain form of cancer may have narrowed down their search to a particular gene. By visiting the human genome database on the world-wide web, this researcher can examine what other scientists have written about this gene, including (potentially) its three-dimensional structure, its function(s), its evolutionary relationships to other human genes, or to genes in mice or yeast or fruitflies, possible detrimental mutations, interactions with other genes, body tissues in which this gene is activated, diseases associated with this gene... the list of datatypes is long, one reason why bioinformatics is so challenging.
One particularly exciting technology arising from genomics is the microarray, an array of probes for simultaneously measuring the amount of each of the 30,000+ human genes present in a given sample. This has aroused great interest as a potential diagnostic tool for science and medicine. It seems likely that there will be many more downstream technologies as a result of the human genome project.
On a more philosophical level, the analysis of similarities between DNA sequences from different organisms is opening new avenues in the study of evolution. In many cases, evolutionary questions can now be framed in terms of molecular biology; indeed, many major evolutionary milestones (the emergence of the ribosome and organelles, the development of embryos with body plans, the vertebrate immune system) can be related to the molecular level. Many questions about the similarities and differences between humans and our closest relatives (the primates, and indeed the other mammals) are expected to be illuminated by the data from this project.