Bioinformaticians worldwide owe a debt to David Haussler (see ) for his significant contributions to the design of algorithms they use every day. In this, he is like his two immediate predecessors as Senior Scientist Accomplishment Award winners, Temple D. Smith and Michael S. Waterman, the eponymous joint authors of the Smith–Waterman algorithm for local alignment of DNA or protein sequence fragments. “Haussler's group was one of the pioneers of machine learning in bioinformatics, introducing Hidden Markov Models for the statistical analysis of patterns in biological data,” says Brunak. However, Haussler's recent achievements have been more in the application of bioinformatics methods than in their development. Since 1999, he has been one of the principal figures in sequencing, and later analysing, the human genome and those of other mammals, and in mining this genomic information for insight into vertebrate evolutionary history.
Haussler originally trained as a mathematician, graduating magna cum laude from Connecticut College and obtaining prizes for mathematics at both Bachelor's and Master's levels. His first encounter with computational biology came in graduate school, at the University of Boulder in Colorado, where he had the good fortune to study for his Ph.D. under Andrzej Ehrenfeucht. “Andrzej is an extraordinary man,” says Haussler. “In our weekly research seminars, we would discuss topics ranging from dinosaur flight to abstract graph theory. He taught me that I should never be constrained by disciplinary boundaries, and never be frightened to tackle big problems. The word “bioinformatics” didn't exist when I was a graduate student, but we were doing it.” Two of his fellow students, Gary Stormo and Gene Myers, have also gone on to have distinguished careers in the field. Stormo, now professor of genetics at the University of Washington in St. Louis, and Deputy Editor-in-Chief of PLoS Computational Biology, has made significant contributions to the study of DNA–protein interactions and the prediction of nucleic acid structure and function; Myers was one of the inventors of the BLAST program, a key innovator in shotgun sequencing, and a principal architect of Celera's draft sequence of the human genome.
Haussler's first years as an independent investigator were devoted to rather abstruse studies in pattern recognition and machine learning, focusing on modelling the way the brain learns. He only shifted from computational neuroscience back to bioinformatics when Anders Krogh joined him at Santa Cruz as a post-doc. Characteristically, Haussler underestimates his own role in their joint achievements. “Anders was an exceptional post-doc, who has gone on to have an exceptional career as an independent scientist. He came to my lab to work on machine learning, but soon discovered that these methods could be applied to biological sequence analysis, to classifying proteins into families and recognising genes in fragments of DNA.” Krogh is co-author of acclaimed and popular textbook Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Other members of Haussler's group applied machine learning techniques to the classification of microarray data, including the development of one of the first expression-based methods for distinguishing tumor from normal cells.
Late in 1999, a phone call changed the direction of Haussler's research. “I was called by Eric Lander, one of the leaders of the public human genome sequencing project, and asked to apply my HMM methodology to identifying the genes in the then newly sequenced human DNA,” he explains. At that time, the public project was in a “full-on race” with Celera to publish an initial working draft of the sequence. Haussler joined the international public effort, rapidly recruiting a team of talented young bioinformaticians that included Jim Kent, winner of ISCB's 2003 Overton Prize.
Barely six months after Haussler joined the project, both teams—the publicly funded one and Celera's—were ready to release their first genome drafts into the public domain. Haussler well recalls July 7, 2000, when the complete draft genome sequence was posted on the University of Santa Cruz' Web server. “Seeing the waterfall of As, Gs, Cs, and Ts pouring off our server was an emotional moment,” he says. “We were witnessing the product of more than three billion years of evolution, sequences passed down from the beginning of life to present-day humans.” This excitement was shared by the worldwide scientific community; Internet traffic on the Santa Cruz server reached 0.5 terabytes per day then: a record that still stands.
Raw DNA sequence, however, is not much use on its own, and Haussler has dedicated the first years of the new millennium to mapping and analysing that sequence. The first release of Santa Cruz' genome browser went online shortly after the human sequence was released, and it now includes twenty complete vertebrate genome sequences, plus those of a few representative invertebrates. “The publication of the second vertebrate genome—that of the mouse—gave us the first real sequence-based insights into the mechanisms of vertebrate evolution,” he says. “And we could also use evolutionary theory and sequence analysis to answer a central question: how much of the mammalian genome is ‘junk’?” Assuming that fewer inter-species substitutions are found in functional DNA than in non-functional DNA, Haussler's team in the mouse genomics consortium were able to estimate that at least 5% of a mammalian genome is functionally important. This value has been confirmed as more complete sequences have emerged. “We may think that 5% is a small value, but it is particularly interesting in that less than 1.5% of the genome codes for proteins. There is still a question over the function of much of the 3.5% that is conserved but does not form protein-coding genes.” Other questions that have attracted Haussler's attention include the analysis of hyper-conserved DNA sequences that remain virtually unchanged in divergent species, and the genetic changes that distinguish humans from apes. While most researchers in this field have concentrated on gene gain during evolution, Haussler and his team recently identified twenty-six genes that are well-established in the vertebrate lineage but that were lost in the latter stages of human evolution.