![]() |
||||
| |
|
|
|
|
by J. William Bell
Natalia Maltsev calls bioinformatics a "science of big numbers." Instead of focusing on a single cell, protein, or organism in the lab or using computer simulation, bioinformatics looks at numerous organisms computationally. It is the search for similarities and differences in hundreds of thousands of genome sequences, protein structures, and other features of biological systems and particles. When properly understood, these variations can show researchers a given piece of the system's function. Labs around the globe constantly churn out new sequences, running the gamut from the most simple virus to hugely complex organisms like ourselves. According to GOLD, an online guide to published genomic data, some 140 species' genomes have been completed and nearly 600 other organisms are currently being sequenced. In August 2002, GenBank, which is one of the main sequence databases, contained maps for some 22 billion nucleotide bases, the individual building blocks that make up a gene sequence. These data are a boon for bioinformatics experts like Maltsev; they're the big numbers that make a science of big numbers possible. But working with them requires tedious search sessions and cumbersome analysis procedures. A new "analysis pipeline" that relies on grid computing promises to automate the genome-analysis process and make it much easier. The Genome Analysis and Database Update system, or GADU, is being developed by Argonne National Laboratory's computational biology group. The team includes Maltsev, Dinanath Sulakhe, and Alex Rodriguez, a PhD student in the University of Illinois at Chicago's bioengineering department, and is part of the Alliance's data quest expedition. The data quest expedition builds tools for data-intensive applications, like those in bioinformatics, that run on Alliance and TeraGrid resources. Further help, support, and guidance come from throughout the Alliance by way of the scientific workspaces for the future expedition and the scientific portals expedition.
"The amount of data is increasing exponentially," says Maltsev. "It dictates a need to really be able to scale up the analysis capabilities…The Grid and distributed computing provide an ideal match for the type of problems that bioinformatics is facing." After a series of test runs on the Alliance's Condor system at the University of Wisconsin and the Chiba City cluster at Argonne, GADU was fully put through its paces in April 2003. The application analyzed 59 microbial genomes in about a day. This process required that more than 10,000 jobs be submitted and represented a five-fold improvement in turnaround time, according to Maltsev. The runs were completed on the Department of Energy's Science Grid using NCSA network bandwidth and storage space. Another run in June further solidified the system's value. The team compared 1.8 million protein sequences to one another using 200 hundred processors on a cluster at Argonne. What would have taken about seven years to complete on a single desktop system was finished in about three days. "This is a great success for the field of bioinformatics--one of the first examples of the discipline taking full advantage of a Grid-based system," says Dan Reed, director of NCSA and the Alliance. "One of the things we have learned over the life of the Alliance is that we're at our best when we form multidisciplinary teams and give those teams a clear mission as they focus on the deployment of technology. So this is also a great example of what the Alliance expeditions teams can do and how those teams can make contributions to a project from end to end." Access Online | Posted 11-4-2003 |
|||