Chapter 2. Administering Checkpoint and Restart

Chapter 2. Administering Checkpoint and Restart
Prev		Next

This chapter describes how to install and administer IRIX Checkpoint and Restart (CPR), and how to configure statefiles. It contains the following sections:

Responsibilities of the Administrator

The system administrator is responsible for the following CPR tasks:

Install CPR software on server systems as required
Help users employ CPR on server systems and workstations
Prevent statefiles from filling up available disk space
Delete, or encourage users to delete, unneeded old statefiles

Installing CPR

The subsystems that make up CPR are listed in Table 2-1.

Table 2-1. CPR Product Subsystems

Subsystem Name	Contents
`eoe.sw.cpr`	Checkpoint and restart software
`eoe.man.cpr`	CPR reference manual pages
`eoe.books.cpr`	This guide as an IRIS InSight document

If CPR is not already installed, follow this procedure to install the software:

Load the IRIX software distribution CD-ROM.
On the server, become superuser and invoke the inst command, specifying the location of the CD-ROM software distribution:
$ /bin/su - Password: # inst -f /CDROM/dist
Prevent installation of all default subsystems using the keep subcommand:
Inst> keep *
For additional information on inst, see the IRIX Admin: Software Installation and Licensing Guide, or the inst(1M) man page.
Make subsystem selections. To install CPR software, the man pages, and the CPR manuals for IRIS InSight, enter the following commands:
Inst> install eoe.*.cpr Inst> list i Inst> go
The list subcommand with the i argument displays all the subsystems marked for installation. The go subcommand starts installation, which takes some time.

For additional information on available subsystems, see the IRIX Release Notes.
Ensure that the following line exists in the /var/sysgen/system/irix.sm file (change cprstub to cpr if necessary):
USE: cpr

Managing Checkpoint Images

Because of their potential size and longevity, checkpoint images (statefiles) are one aspect of CPR where intervention by the system administrator may be required.

Statefile Location and Content

The statefile can exist anywhere on a filesystem where the user has write permission, provided there is enough disk space to store it. Statefiles tend to be slightly larger than their checkpointed process.

As the system administrator, you might want to create a policy saying that checkpoint images stored in temporary directories (such as /tmp or /var/spool) are not guaranteed to remain there. If users want to preserve a statefile indefinitely, they should place it in a permanent directory that they own themselves, such as their home directory.

Checkpoint images contain much information about a process, including process set IDs, copies of user data and stack memory, kernel execution states, signal vectors, a list of open files and devices, pipeline setup, shared memory, array job states, and so on.

Monitoring a Checkpoint

To obtain information about a statefile directory, run the cpr command with the -i option:

$ cpr -i statefile ...

This displays information about the statefile revision number, process names, credential information for the processes, the current working directory, open file information, the time when the checkpoint was done, and so forth.

There is no automated way to tell if a user has restarted a statefile or not. You need to ask.

Removing Statefiles

First check with the checkpoint owner to request that they remove unneeded statefiles. If there is no reply, and checkpoints are overflowing disk space, look for the oldest statefiles, especially ones in a series, as the best candidates for removal.

To delete an entire statefile directory, run the cpr command with the -D option:

$ cpr -D statefile ...

Only the checkpoint owner and the superuser may delete a statefile. Once a checkpoint has been deleted, it cannot be restarted until the statefile is restored from backups.

Disabling User Checkpoints

If you want to restrict user access to CPR, or if some users abuse the facility by leaving around large statefile directories, you can follow this procedure:

Create a “cpr” group in the CPR server's /etc/group file, listing the users who should have access to CPR.
cpr::100:user1,user2,user3,user4,user5,user6

Make the cpr command group “cpr” and mode 4750.

# chgrp cpr /usr/sbin/cpr 
# chmod 4750 /usr/sbin/cpr

To temporarily disable CPR, make the /usr/sbin/cpr command 000 mode. To permanently shut off CPR, use the inst command to remove the eoe.sw.cpr subsystem.

Checkpointable Objects

The following system objects are checkpoint safe. See “Checkpoint-Safe Objects” in Chapter 3 for complete coverage of checkpoint safety.

UNIX processes, process groups, terminal control sessions, IRIX array sessions, process hierarchies, sproc() groups (see the sproc (2) man page), and random process sets
All user memory area, including user stack and data regions
System states, including process and user information, signal disposition and signal mask, scheduling information, owner credentials, accounting data, resource limits, current directory, root directory, locked memory, and user semaphores
System calls, if applications handle return values and error numbers correctly, although slow system calls may return partial results
Undelivered and queued signals are saved at checkpoint and delivered at restart
Open files (including NFS-mounted files), mapped files, file locks, and inherited file descriptors; this includes open pipes with pipeline data
Special files /dev/tty, /dev/console, /dev/zero, /dev/null, and ccsync(7M)
UNIX System V shared memory (but the original shared memory ID is not restored); see the shmop(2) man page
IRIX jobs; see the job_limits(5) man page
Jobs started with ChallengArray services, provided they have a unique ASH number; see the array_services(5) man page
Applications using the prctl() PR_ATTACHADDR option; see the prctl(2) man page
Applications using blockproc() and unblockproc(); see the blockproc(2) man page
The Power Fortran join synchronization accelerator; see the ccsync(7M) man page
R10000 counters; see the libperfex(3c) and perfex(1) man pages

Non-Checkpointable Objects

The following system objects are not checkpoint safe. See “Limitations and Caveats” in Chapter 3 for more complete coverage of unsupported system objects.

Network socket connections; see the socket(2) man page
X terminals and X11 client sessions
Special devices such as tape drivers and CD-ROMs
Files opened with setuid credential that cannot be reestablished
UNIX System V semaphores and messages (as opposed to System V shared memory); see the semop(2) and msgop(2) man pages

Troubleshooting

This section provides a guide to various error messages that could appear during checkpoint and restart operations, and what these messages might indicate.

Failure to Checkpoint

Checkpointing can fail for any of the reasons shown in Table 2-2.

Table 2-2. Checkpoint Failure Messages

Error Message	Problem Indicated
Permission denied	Search permission denied on a pathname component of statefile.
Resource busy	A resource required by the target process is in use by the system.
Checkpoint error	An uncheckpointable resource is associated with the target process.
File exists	The pathname designated by statefile already exists.
Invalid argument	An invalid argument was passed to a function call.
Too many symbolic links	A symbolic link loop occurred during pathname resolution.
No such file or directory	The pathname to statefile is nonexistent.
Not a directory	A component of the path prefix is not a directory.
Filename too long	The pathname to statefile exceeds the maximum length allowed.
No space left on device	Space remaining on disk is insufficient for the statefile.
Operation not permitted	The calling process does not have appropriate privileges.
Read-only file system	The requested statefile would reside on a read-only filesystem.
No such process	The process or process group specified by ID does not exist.

Failure to Restart

Restart can fail for any of the reasons shown in Table 2-3.

Table 2-3. Restart Failure Messages

Error Message	Problem Indicated
Permission denied	Search permission denied on a path component of statefile.
Resource temporarily unavailable	Total number of processes for user exceeds system limit.
Checkpoint error	An unrestartable resource is associated with target process.
Resource deadlock avoided	Attempted locking of a system resource would have resulted in a deadlock situation.
Invalid argument	An invalid argument was passed to the function call.
Too many symbolic links	A symbolic link loop occurred during pathname resolution.
Filename too long	The pathname to statefile exceeds the maximum length.
No such file or directory	The pathname to statefile is nonexistent.
Not enough space	Restarting the target process requires more memory than allowed by the hardware or by available swap space.
Not a directory	A component of the path prefix is not a directory.
Operation not permitted	The real user ID of the calling process does not match the real user ID of one or more processes recorded in the checkpoint, or the calling process does not have appropriate privileges to restart one or more of the target processes.

Prev	Table of Contents	Next
Chapter 1. Using Checkpoint and Restart		Chapter 3. Programming Checkpoint and Restart