Checkpoint and Restart

Comments

Have you ever toiled for hours on a document only to have your computer crash just as the last line gets written? From that type of experience, most users learn to save early and often so their work can be restored in the event of a crash.

The mechanism for protecting operating systems and enterprise applications from crashes is no different. But instead of being called save and restore, it's called checkpoint and restart (CPR).

Basically, checkpoint/restart mechanisms allow a machine that crashes and is subsequently restarted to continue from the checkpoint with no loss of data, just as if no failure had occurred.

"It's important in all computational-intensive programming," says Ed Hall, a research computing system analyst at the University of Virginia in Charlottesville. "You need to have the ability in case the system crashes or, if it's a batch system with a time limit, to terminate at a given time."

At some firms and supercomputing centers, it's common practice to break up long-running computational programs into several batches. Programs such as a gene- sequencing application search through enormous databases and execute complex algorithms that can take several weeks to complete.

Typically, at certain time intervals or at the beginning of the business day, these long-running programs get intentionally stopped after a checkpoint so that smaller jobs can then be processed. When the smaller jobs have finished, the larger program restarts at the last checkpoint.

But while the concept is easy to understand, the technical mechanism to checkpoint and restart an operating system or application is quite complex.

OS Vs. Application

Checkpointing can occur either within the operating system or at the application level.

Most mainframes and high-end server operating systems, such as Mountain View, Calif.-based Silicon Graphics Inc.'s Irix or Seattle-based Cray Inc.'s Unicos, have automated CPR utilities. CPR at the operating system level saves the state of everything that's being done within a given application at periodic checkpoints and allows the system to restart from the last point. This type of checkpoint enables a user to shut down a computer and bring it up again without losing any work.

However, on very large computers with hundreds or thousands of processes running, saving the entire state of an operating system can take a long time. It also takes a long time to later restart the machine at that state - on large jobs, it could take several hours. The recovery is delayed because a large amount of data must be stored, whether or not the application requires that information to fully restart it.

"Checkpointing at the operating system is useful but very costly, in that the operating system does not know what data the application really needs to restore it later, so it blindly saves everything," explains James Kasdorf, director of special projects at the Pittsburgh Supercomputing Center.

"If you imagine a machine with the same data replicated on 512 processors, a systemwide checkpoint does not know that and it saves everything, so you end up with hundreds of unneeded copies of data, program code and system libraries," he says.

Checkpointing at the application level is the other option. To perform CPR within the application, the application uses operating system hooks that enable it to save the relevant resources and data needed for a restart.

At the application level, a developer can pick an optimal point - typically at the end of an iterative cycle - to perform a checkpoint to make the process more efficient, according to Reagan Moore, principal investigator for scientific computing at the San Diego Supercomputer Center.

This type of application-centric CPR can be more efficient because only needed data gets saved, making it easier to checkpoint the application and quick to restart later.

The challenge with application CPR is that it's difficult to do in some cases, such as if the application has an open communications channel to an external device or the application runs on a clustered computer. In these cases, it's difficult to save the state of an application as it gets communicated across several network nodes, Moore says.

It also takes longer to checkpoint and restart applications with large buffer memory.

According to Kasdorf, the best way to optimize the checkpoint and restart process is to have fast I/O speeds on the computer. If the checkpoint data can be written quickly to the disk, then frequent checkpoints won't result in a long CPR process.

Join the newsletter!

Error: Please check your email address.

More about Silicon Graphics