Scaling new heights

You think you've got problems. Your computer systems have been turned outwards enabling you to digitally interact with customers and business partners.

As a result, your company is now able to take orders online and streamline business processes. However, nobody counted on the fact that all this meant your mission-critical servers might be swamped with thousands of simultaneous users, all demanding 24x7 uptime.

It all used to be so much easier when computer systems were designed solely for internal users. It was simple to calculate the demands that would be placed on the system and therefore how much computing power was required. Not only that, if the system went down for whatever reason, while inconvenient it was hardly the end of the world.

Now that systems increasingly face outwards, it's a whole new ball game. A new marketing campaign, for example, could result in a surge in traffic beyond anyone's wildest expectations. What's more if the system chokes, while it still might not mean the end of the world, it could mean the end of a promising IT career. Any such system failure would be likely to not only damage a company's invaluable good will and reputation, but might also cost the company sales and revenue.

Suddenly, enterprise users are facing the same challenges that were in the past more commonly faced by the scientific and research communities. That is, how do you maintain very high availability while ensuring systems are capable of scaling in order to meet ever increasing performance demands.

Welcome to John O'Callaghan's world. O'Callaghan is executive director for the Australian Partnership for Advanced Computing (APAC), which operates the most powerful computing facility in Australia. When the National Facility is upgraded in October it will be one of the world's 60 most powerful computer systems.

So, if you think that 32-way Unix server you've been scoping has some impressive grunt, consider that APAC's Compaq Alphaserver SC system will, when complete, incorporate 450 Alpha processors. That equates to a teraflop of computing power, capable of processing more than a million floating point instructions per second.

Now that's scalability.

While impressive, APAC's high-performance computing system isn't so different to the servers you would find in any large corporation. While the 'supercomputers' that were used by the scientific and research communities in the past, manufactured by vendors like Cray, bore very little similarity with your everyday garden variety enterprise servers, that's not the case today, O'Callaghan said.

"Supercomputers used to be described as islands of computing capability. There had their own operating systems, languages and components and were very hard to use."

Today, high-performance computers are described as "advanced computing" systems, rather than supercomputing systems, recognising the fact that they use generic processors and disk technologies and so on - just on a much larger scale.

"Really, they're just an extension of the kinds of systems you would find on any desktop," O'Callaghan said.

What that means, according to O'Callaghan, is that advances that are being made in terms of mass scalability and high availability in the advanced computing realm will eventually filter down in the enterprise arena. Indeed, it's not implausible that somewhere down the track we will all plug into the same high performance computing 'grid'.

The new APAC National Facility is hosted at the Australian National University (ANU) in Canberra. APAC is a partnership between consortium partners in each Australian state as well as the ANU and CSIRO. In total, there are 29 Australian universities that are a part of APAC.

Between the partners and the Federal Government, around $80 million will be invested in APAC over the next three years. The cost of the National Facility, including staff, over that time will be about $22 million.

In its initial months of operation, more than 300 researchers are to make use of the National Facility. O'Callaghan said the core use of the system would be in computational chemistry and physics; however, its application would also extend to areas like astrophysics, geophysics, environmental modelling and even data mining and financial applications.

Some of the early applications of the APAC facility have included:

-Studying the porous flow of fluids for the recovery of oil from underground reservoirs;-Analysing the complex folding patterns of proteins to help design drugs;-Simulating turbulent flow for more efficient jet engines;-Imaging the earth through seismic data processing;-Developing new techniques to combat credit card fraud.

Naturally, such complex applications require enormous computing power, so performance and scalability was high on the selection criteria when APAC went computer shopping.

Of particular importance was sustained performance, which according to O'Callaghan, is very different to peak performance. Sustained performance is measured by how fast the processor works on actual code and can only be measured by benchmarking, while peak performance is basically the sum of a computer system's processing power.

"The difference between sustained performance and peak performance can be as much as two to one and it's something that differs dramatically between vendors."

Also important, was the overall balance of the system. Processors are only one component of any computer system and the speeds achieved in processing need to be matched by the interconnection path between processors and the access to memory.

"There's not much point having a super-fast processor if it takes a long time for data to get between the memory and the processor," O'Callaghan said.

Finally, the ability to scale was an important criterion as "it's very important that we are able to upgrade the system over time to meet future requirements," he said.

There are no real limits to how far APAC's advanced computing architectures can scale, O'Callaghan said. If one system runs out of gas, an external switch could be used to interconnect additional resources.

"If you were forced to have multiple switches between a certain number of nodes you might see performance begin to degrade but it would be very dependent on how you allocated processors to each job," he said.

"A study was done a couple of years ago on whether it was possible to build petaflop systems and the conclusion was that with the current trends in technological development, today's current system architectures would scale that far."

While performance and scalability is critical for APAC, high availability is just as important.

"In the enterprise space, you might shoot for 99.999 per cent availability. We want that too, but for different reasons," O'Callaghan said.

"The difference with our systems is that we run very large jobs that are executed over a large amount of time. We might have a job running across several hundred processors and if we lose a processor in that time we could lose the whole job.

"So for us, availability is absolutely imperative or it means we're running all these massive jobs again and again."

It would be a mistake to think that achieving the kind of scale and the level of availability that APAC is able to achieve is just a result of advanced hardware though. While, a system must achieve balance between its hardware components, software is just as crucial in the overall technology solution.

"One of the major issues in high-performance computing is the robustness of the overall software solution and its ability to handle problems," O'Callaghan said.

"It's not an area that is all that well developed in high performance computing, but we need capabilities likes checkpointing and being able to restart jobs, load balancing and so on.

"While these sorts of capabilities are getting much better, there's still a lot of room for improvement."

As well as the underlying operating system and systems management software, the user code or application must also have been designed to make use of the massive multiprocessing capability of APAC's computer system. While intelligent compilers were making this easier, again there was much improvement needed.

O'Callaghan said there were around 20 staff used to run the APAC National Facility. About half of these looked after systems management looking at load balancing issues and how best to assign jobs. The other half were dedicated to supporting users, helping them to design efficient code and looking after any problems they were having.

"It's a very complicated, overall system and you can't just tune one part of it and expect that it's going to make a difference," he said.

The next step for advanced computing is to begin linking the world's advanced computing facilities via broadband in order to create virtual super computers.

"The current wave of interest within the research community is this concept of The Grid," said O'Callaghan. "The goal is you can just plug in various computer devices and get access to all these high-performance computer resources and complex applications."

The GrangeNet (Grid and Next Generation Network) consortium, of which APAC is a key member was granted $14 million at the end of May from the Federal Government, in order to build a 2.5Gbit/sec backbone that would link its high performance computing systems in Canberra, Sydney, Melbourne and Brisbane to all of AARNet's Points-of-Presence (POPs).

GrangeNet will be part of the global research and education network and will develop and deploy "grid services" including distributed computing, collaborative visualisation, cooperative environments and digital libraries.

Again, these are the type of applications and advances that the research and scientific communities are pioneering that will more than likely to find their way into the world of enterprise computing, O'Callaghan said. After all, that was how the Internet came to be.