Computerworld

Biotechnology: Writing the Book of Life

FRAMINGHAM (03/03/2000) - It was early spring in Cambridge, England, in 1953.

James Watson and Francis Crick were frantically racing against one of the world's most renowned researchers--Linus Pauling--to determine the chemical structure of DNA. As with many scientific discoveries, everything seemed to fall into place suddenly. "The brightly shining metal plates were then immediately used to make a model in which for the first time all the DNA components were present," Watson wrote in The Double Helix: A Personal Account of the Discovery of the Structure of DNA. "In about an hour I had arranged the atoms in positions which satisfied both the X-ray data and the laws of stereochemistry."

As you read Watson's elegant description, it's easy to overlook all of the painstaking research that went into the discovery. Yet as any scientist will quickly point out, developing sound theory relies on the meticulous acquisition and analysis of data--in most cases, huge amounts of it.

That's as true today as it was in 1953. But now scientists are tackling an even grander challenge than the one Watson and Crick faced. Today researchers all over the world are in a quest to unravel one of the greatest scientific mysteries of all time--the genetic code that makes each of us unique, also known as the human genome.

Two key players in this rapidly unfolding drama are the federally funded Human Genome Project and the private company Celera Genomics, and their success may be determined as much by IT as it is by science. Indeed, much of the work behind mapping the sequence of the human genome depends on developing an IT infrastructure that can acquire, analyze and store enormous amounts of data quickly and accurately.

The stakes are high. Researchers say that sequencing the human genome will revolutionize health care. Not only will scientists learn more about the origins of certain diseases and why certain people are predisposed to developing them, but pharmaceutical and biotechnology companies will be able to dramatically reduce the amount of time and money needed to develop new drugs.

The new drugs, in turn, should cause fewer side effects--a major benefit, given the fact that an estimated 100,000 patients die in U.S. hospitals each year because of adverse reactions to their medications, according to a recent article in the Journal of the American Medical Association.

DIVING INTO THE DATA Work on sequencing the human genome began in earnest in 1990 when the U.S. government launched the Human Genome Project. The goal was to deliver a map of the entire human genome within 15 years. The National Human Genome Research Institute (www.nhgri.nih.gov), which heads up the Human Genome Project for the National Institutes of Health, now says that a working draft will be available this spring. Meanwhile, scientists in six countries--in government organizations, universities and private corporations--are collaborating and tackling different pieces of the sequence.

The sheer volume of data is staggering: An estimated 3 billion base pairs are what make up the human genome. (In case you've forgotten your genetics, DNA is a double helix containing pairs of the building blocks, or "bases," adenine and thymine, and guanine and cytosine: A-T and G-C). If you find it difficult to visualize this amount of data, imagine reading out loud the base pairs at the rate of 3 per second--without stopping. It would take you more than 10 years to recite all 3 billion pairs of letters, according to the Human Genome Project.

But before scientists can begin looking at the arrangement of the base pairs within the human genome, they need pieces of the genome. So every researcher begins with the same basic series of steps. Laboratories are set up with robots to automate the daily preparation of thousands of DNA samples obtained from anonymously donated specimens of blood and semen. Once the samples are readied, the data is extracted with sophisticated machines known as sequencers. The next step is to convert the data from analog to digital signals for processing by computers. The data is then cleaned up, compared with other known sequence data via standard search algorithms such as Blast (basic local alignment search tool) and then stored in a database.

Then comes one of the biggest challenges: reassembling the data to create a picture of the human genome. Since a sequencer can look at only about 500 bases at a time, the DNA must be assembled--in much the same way that you would try to reassemble numerous Sunday newspapers that have been shredded into thousands of tiny pieces.

Researchers face several key challenges when it comes to working with the data.

First, there's the issue of managing it. Not only must they acquire and store the data being pulled from the sequencers, but they must also track data associated with each step of the process (for example, temperature, movement from team A to team B, etc.). It's not uncommon for a lab to process 80,000 samples of DNA each day; that alone translates into about 15GB of sequence data per day. Meanwhile, since the data has such enormous value, it's stored indefinitely. Since many of the applications needed to run a genomics center are either not available commercially or were not designed to handle the huge amounts of data required to map the human genome, IT staff must often customize software available in the public sector (e.g., Blast) or develop new applications in-house.

Another challenge involves an issue that most CIOs know all too well: getting heterogeneous applications to talk to one another. Researchers must somehow integrate the applications that ship with the sequencers with their own applications.

BRAVE NEW WORLD Scientists at MIT's Whitehead Institute for Biomedical Research (wi.mit.edu) are spearheading much of the federal effort to sequence the human genome. Led by Eric Lander, the Whitehead/MIT Center for Genome Research in Cambridge, Mass., has assembled an impressive array of staff and technology.

The sequencing bioinformatics staff of 18 has a wide range of expertise in such fields as physics, biology, neurobiology and computer science (see "An Emerging Industry," left). Nearly all have some experience with software engineering.

Meanwhile, the nine-person computer systems operations staff provides technical and user support.

The list of hardware and software is equally impressive: 123 sequencers; 17 4-processor SMPs are pipeline, assembly, database and file servers; Compaq StorageWorks RAID arrays with 5 terabytes of storage; two Sybase database production environments; and a slew of custom-developed applications. The sequence data is stored in Unix flat files while data gathered by the Center's laboratory management information systems is stored in Sybase relational databases.

Each night the newly assembled sequence data is automatically updated and archived in-house. (Since receiving $35 million from the National Human Genome Research Institute last March, the Whitehead Institute is scaling up its DNA sequencing from 750 million base pairs per year to 17 billion base pairs annually.) The data is also sent via the internet to GenBank, a public database maintained by the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov). (NCBI also develops software tools for analyzing genome data, conducts research in computational biology and distributes biomedical information.) NCBI then duplicates the new data to other public databases in Europe and Japan. The emphasis is clear: make the data available as soon as possible to as many people as possible.

Many of the challenges encountered by researchers at the Whitehead Institute may sound familiar. "We face the same issues that anyone building a large production facility faces--reliability, availability and scale," says Jill Mesirov, director for bioinformatics and research computing. "Biology is very new to production on this kind of scale, so we're constantly talking about the best ways to do this work." Since the amount of data continues to grow at a phenomenal rate, scalability is a moving target. "Our vision of what something will look like in six months often changes," says K.M. Peterson, manager of computer systems operations.

Even with all of the challenges, researchers have been able to scale up production dramatically. "Most of this would have been impossible 10 years ago," says Lauren Linton, codirector of the Whitehead/MIT Center for Genome Research. "When I was in graduate school, we had to call out the [base] letters and write them down. We currently generate about 50 million letters each day.

Now the software handles, stores and sifts through all the data."

The Human Genome Project has already made significant progress. Last November, government scientists announced that they had identified, sequenced and published one-third of the human genome. Less than a month later came another spectacular announcement: the sequencing--for the first time ever--of the DNA of an entire human chromosome.

THE BUSINESS OF BIOTECH Sequencing the human genome also has a corporate side.

Companies such as PE Corp. (www.pe-corp.com) were quick to realize that pharmaceutical companies would be willing to pay handsomely for information that allows them to develop drugs more quickly. PE jumped into the arena two years ago when it convinced Dr. J. Craig Venter to leave his position as president of the nonprofit Institute for Genomic Research (www.tigr.org) and launch yet another genomics center--this time, a company called Celera Genomics (www.celera.com). (Venter remains chairman of TIGR's board.) Celera says that it will sequence the human genome faster and more cheaply than the government's Human Genome Project. While some characterize efforts by the two groups as a race, others dismiss the notion. In fact, the Human Genome Project and Celera have chosen different approaches to solving the problem--in terms of both science and IT.

At the heart of Celera's approach is a sequencing method that Venter invented.

Known as "shotgun sequencing," the approach calls for blasting the entire genome into small pieces of DNA and then assembling the fragments into the proper order by matching the overlapping sequences at the ends of the fragments. The method is faster than the one used by the federal government--which involves sequencing the entire genome by using one known large fragment at a time--and is controversial, with some researchers concerned that Celera's results will not be accurate.

Since opening its doors, Celera has moved quickly. Venter assembled a team of well-known researchers including Samuel Broder, former director of the National Cancer Institute; Nobel Laureate Hamilton Smith, who discovered the Type II restriction enzyme used in gene cloning; and Eugene Myers, who developed the aforementioned Blast bioinformatics algorithm. Last September, Celera announced that it had finished sequencing the genome for Drosophila melanogaster, the fruit fly. One month later, the Rockville, Md.-based company announced yet another major milestone: the sequencing and delivery of approximately 1.2 billion base pairs of human DNA to its subscribers. (Celera filed provisional patent applications on 6,500 of the genes at the same time. See "A Question of Ethics," above.) Ultimately, Celera wants to become the definitive source of all genomic information by giving subscribers the tools to access and analyze its data via the internet. To meet that goal, Celera has worked closely with Compaq Computer Corp. to develop what company officials refer to as the world's second largest supercomputer facility. It has already installed more than 200 Compaq AlphaServer ES40 systems running 500MHz Alpha processors, 11 GS140 servers and 50 terabytes of StorageWorks storage. On order are an additional 28 ES40s and three WildFires, one with 128GB of RAM. Meanwhile, Celera has also leased 300 3700 DNA sequencers from PE Biosystems. This IT arsenal runs on a switched backbone that supports throughput of 500GB per second.

AN E-BUSINESS MODEL In many ways, Celera's pipeline for acquiring and processing data is similar to that of the Whitehead Institute. There are, however, several key differences. For one thing, Celera has developed its IT infrastructure with e-commerce in mind. Celera customers, for example, tap into their own databases running on Celera servers via the internet. (Each customer database is updated weekly. Three copies of the data are archived forever on tape.) Celera maintains customer data in more than 60 separate databases.

As the company broadens its subscriber base, that model is expected to change.

By the end of the year, smaller customers will go to one database where they will access the specific data that interests them. Digital certificates will ensure key information such as identification, access levels and billing.

Meanwhile, larger companies will probably continue to have their own databases maintained by Celera.

Celera has signed up four database customers so far: Pharmacia and Upjohn, Amgen, Novartis and Pfizer. The first three ponied up $5 million for five years to gain early access to Celera's database last year. The company refuses to disclose how that fee structure may change as other companies join the fold.

Not-for-profit research organizations, however, will be charged less. "We have said that we will charge between $5,000 and $20,000 per year per lab for a university," says Paul Gilman, director of policy planning. (The amount per lab will be smaller for universities with more labs.) And what do customers get for these fees? In addition to access to the actual data, they also gain access to annotation information (for example, details about whether a certain gene has been seen before, whether it has been patented, etc.), comparative genomics (comparisons with the fruit fly and mouse genome), Celera's computational facilities and a wide assortment of software tools. Celera also saves customers the time and expense of accessing dozens of databases worldwide. "We are not just a sequence DNA database," says Gilman.

"We provide extensive annotation and the best tools for manipulating and analyzing the data."

Because e-commerce is the order of the day, Celera faces certain business pressures. "There's so much data that performance and bandwidth are sometimes an issue," says Marshall Peterson, vice president of infrastructure technology.

"Customers can either download their tools and data to our servers or they can pull the database to their site. We want to convince them to analyze the data here." Since it's often impossible to predict what customers will need in six months, flexibility is crucial. "We're seeing huge increases in the numbers and types of customers," says Peterson. "We still don't know how customers will want to access our data in the future, so we want to offer flexibility in terms of billing--whether it's by user, CPU, number of queries or other methods."

Flexibility will be especially important as Celera eventually adds data from public databases such as GenBank to its stockpile of information.

RETURN ON INVESTMENT Celera's business model quickly captured the interest of the business world. Wall Street welcomed the new venture with open arms, sending the price of its stock from 14 3/16 in May 1999 to over 190 by the end of the year. Meanwhile, financial gurus Tom and David Gardner of the personal investment website Motley Fool (www.fool.com) told followers that they plan to invest $50,000 in the company. And according to biotechnology analyst Eric Schmidt at investment bank S.G. Cowen in New York City, the future looks even brighter. "We think that Celera will win big time because of their dream team, the facilities and their proprietary advantage that will allow them to gain market share," says Schmidt.

Industry observers are also betting on Celera. "There are many companies out there that are sequencing, but Celera is the only one that has tackled the entire genome," says Lynn Arenella, associate professor of biology at Bentley College in Waltham, Mass. Arenella, whose expertise includes the commercialization of biomedical technologies, believes that Celera has an important edge. "Companies that have the ability to organize and interpret the data will come out on top. Craig Venter is talking about becoming the [Michael] Bloomberg of genetics. And he has always made good on his promises."

It's not clear when Celera or the Human Genome Project will be able to deliver on all of their promises. When they do, however, the results could affect the lives of everyone you know.

Louise Fickel is a freelance writer based in Denver. She can be reached at RiceKid@ix.netcom.com.

AN EMERGING INDUSTRY IT opportunities abound in biological research bioinformatics--or the application of i.t. to biological scientific research--is a rapidly evolving field. Not only are organizations such as the government's Human Genome Project and Celera Genomics developing their own bioinformatics capabilities in-house, but new companies that provide bioinformatics services to companies in the pharmaceutical and biotechnology industries are sprouting up everywhere.

One such company is DoubleTwist (www.doubletwist.com). Founded in 1993 as Pangea Systems by two Stanford University graduate students, the company reinvented itself late last year as an application service provider (ASP). Log on to the company's website and you'll find the tools needed to retrieve and interpret genomic data. You can, for example, compare DNA and protein sequences or monitor the status of sequence patents. Access to basic functionality and research tools is free. More detailed views of data will be determined by a fee structure that DoubleTwist is currently developing.

"The hard thing is processing, interpreting and annotating data, and we're doing all of it on the web," says Robert Williamson, chief operating officer at the Oakland, Calif.-based company. "We're not doing the sequencing. We're simply trying to make all of that data usable. If we're successful, we'll democratize genetic research."

The company has equally lofty business goals. "Now instead of selling to 300 huge companies, I can empower directly scientists worldwide," says CEO John Couch. Couch, who held top-level positions at Apple Computer during its early years, says that this industry reminds him of a different time. "When I looked at our software, I said, 'This is dj vu.' The internet is equivalent to the microprocessor. I knew that if I could host that environment, the cost would fall dramatically."

Others agree that the field is ripe with opportunities--especially for IT professionals. "This is an extremely hot area," says Ty Rabe, director of Compaq's High Performance Technical Computing Solutions Group in Marlboro, Mass. "We're finding that virtually all pharmaceutical, biotechnology and chemical companies are now creating bioinformatics centers. The field is up for grabs."

And what if you think you might be interested in branching out into bioinformatics? Learn some biology, advises Jill Mesirov, director for bioinformatics and research computing at MIT's Whitehead Institute in Cambridge, Mass. And be prepared to make history. "This whole project will create a new paradigm for biology research," says Mesirov. "I can't repeat often enough how exciting it is to be in this field, in this place, at this time." -L. Fickel A QUESTION OF ETHICS By patenting the genes it sequences, Celera is causing a stir in the industry among the more controversial actions that Celera Genomics has taken is filing for more than 6,000 patents on genes. Patenting those genes will ensure that Celera will have exclusive rights for a certain period of time to develop products based on the discovery of those genes. (Rather than developing new drugs or tests itself, Celera plans to license its intellectual rights to customers such as Pfizer.) While some view the patenting of genes as immoral and antithetical to research, others see it as a necessity. "Patenting has nothing to do with ethics," says Chuck Ludlam, vice president for government relations at the Biotechnology Industry Organization (www.bio.org) in Washington, D.C. "Patenting confers protection, not ownership. And that protection has a limited use. We need patents to provide an incentive to companies before they will invest hundreds of millions of dollars in developing new products. Otherwise, we won't get them involved."

"I'm not among those who believe that patenting is necessarily a bad idea," says Ronald Green, director of Dartmouth College's Institute for the Study of Applied and Professional Ethics in Hanover, N.H. "However, if companies unreasonably charge for materials they have patents on, they can actually deter other researchers from moving ahead." According to Green, the federal government should step in and beat the private companies at their own game.

"Too often, the federal government gives away its own rights," says Green. "The Human Genome Project should retain patents that will allow them to control utilization and research. That would also ensure that private companies don't reap benefits from federally funded research." -L. Fickel