Chapter 2. Administering Checkpoint and Restart

This chapter describes how to install and administer IRIX Checkpoint and Restart (CPR), and how to configure statefiles. It contains the following sections:

Responsibilities of the Administrator

The system administrator is responsible for the following CPR tasks:

  • Install CPR software on server systems as required

  • Help users employ CPR on server systems and workstations

  • Prevent statefiles from filling up available disk space

  • Delete, or encourage users to delete, unneeded old statefiles

Installing CPR

The subsystems that make up CPR are listed in Table 2-1.

Table 2-1. CPR Product Subsystems

Subsystem Name

Contents

eoe.sw.cpr

Checkpoint and restart software

eoe.man.cpr

CPR reference manual pages

eoe.books.cpr

This guide as an IRIS InSight document

If CPR is not already installed, follow this procedure to install the software:

  1. Load the IRIX software distribution CD-ROM.

  2. On the server, become superuser and invoke the inst command, specifying the location of the CD-ROM software distribution:

    $ /bin/su - 
    Password:
    # inst -f /CDROM/dist 
    

  3. Prevent installation of all default subsystems using the keep subcommand:

    Inst> keep *
    

    For additional information on inst, see the IRIX Admin: Software Installation and Licensing Guide, or the inst(1M) man page.

  4. Make subsystem selections. To install CPR software, the man pages, and the CPR manuals for IRIS InSight, enter the following commands:

    Inst> install eoe.*.cpr 
    Inst> list i
    Inst> go 
    

    The list subcommand with the i argument displays all the subsystems marked for installation. The go subcommand starts installation, which takes some time.

    For additional information on available subsystems, see the IRIX Release Notes.

  5. Ensure that the following line exists in the /var/sysgen/system/irix.sm file (change cprstub to cpr if necessary):

    USE: cpr 
    

Managing Checkpoint Images

Because of their potential size and longevity, checkpoint images (statefiles) are one aspect of CPR where intervention by the system administrator may be required.

Statefile Location and Content

The statefile can exist anywhere on a filesystem where the user has write permission, provided there is enough disk space to store it. Statefiles tend to be slightly larger than their checkpointed process.

As the system administrator, you might want to create a policy saying that checkpoint images stored in temporary directories (such as /tmp or /var/spool) are not guaranteed to remain there. If users want to preserve a statefile indefinitely, they should place it in a permanent directory that they own themselves, such as their home directory.

Checkpoint images contain much information about a process, including process set IDs, copies of user data and stack memory, kernel execution states, signal vectors, a list of open files and devices, pipeline setup, shared memory, array job states, and so on.

Monitoring a Checkpoint

To obtain information about a statefile directory, run the cpr command with the -i option:

$ cpr -i statefile ... 

This displays information about the statefile revision number, process names, credential information for the processes, the current working directory, open file information, the time when the checkpoint was done, and so forth.

There is no automated way to tell if a user has restarted a statefile or not. You need to ask.

Removing Statefiles

First check with the checkpoint owner to request that they remove unneeded statefiles. If there is no reply, and checkpoints are overflowing disk space, look for the oldest statefiles, especially ones in a series, as the best candidates for removal.

To delete an entire statefile directory, run the cpr command with the -D option:

$ cpr -D statefile ... 

Only the checkpoint owner and the superuser may delete a statefile. Once a checkpoint has been deleted, it cannot be restarted until the statefile is restored from backups.

Disabling User Checkpoints

If you want to restrict user access to CPR, or if some users abuse the facility by leaving around large statefile directories, you can follow this procedure:

  1. Create a “cpr” group in the CPR server's /etc/group file, listing the users who should have access to CPR.

    cpr::100:user1,user2,user3,user4,user5,user6 
    

  2. Make the cpr command group “cpr” and mode 4750.

    # chgrp cpr /usr/sbin/cpr 
    # chmod 4750 /usr/sbin/cpr 
    

To temporarily disable CPR, make the /usr/sbin/cpr command 000 mode. To permanently shut off CPR, use the inst command to remove the eoe.sw.cpr subsystem.

Checkpointable Objects

The following system objects are checkpoint safe. See “Checkpoint-Safe Objects” in Chapter 3 for complete coverage of checkpoint safety.

  • UNIX processes, process groups, terminal control sessions, IRIX array sessions, process hierarchies, sproc() groups (see the sproc (2) man page), and random process sets

  • All user memory area, including user stack and data regions

  • System states, including process and user information, signal disposition and signal mask, scheduling information, owner credentials, accounting data, resource limits, current directory, root directory, locked memory, and user semaphores

  • System calls, if applications handle return values and error numbers correctly, although slow system calls may return partial results

  • Undelivered and queued signals are saved at checkpoint and delivered at restart

  • Open files (including NFS-mounted files), mapped files, file locks, and inherited file descriptors; this includes open pipes with pipeline data

  • Special files /dev/tty, /dev/console, /dev/zero, /dev/null, and ccsync(7M)

  • UNIX System V shared memory (but the original shared memory ID is not restored); see the shmop(2) man page

  • IRIX jobs; see the job_limits(5) man page

  • Jobs started with ChallengArray services, provided they have a unique ASH number; see the array_services(5) man page

  • Applications using the prctl() PR_ATTACHADDR option; see the prctl(2) man page

  • Applications using blockproc() and unblockproc(); see the blockproc(2) man page

  • The Power Fortran join synchronization accelerator; see the ccsync(7M) man page

  • R10000 counters; see the libperfex(3c) and perfex(1) man pages

Non-Checkpointable Objects

The following system objects are not checkpoint safe. See “Limitations and Caveats” in Chapter 3 for more complete coverage of unsupported system objects.

  • Network socket connections; see the socket(2) man page

  • X terminals and X11 client sessions

  • Special devices such as tape drivers and CD-ROMs

  • Files opened with setuid credential that cannot be reestablished

  • UNIX System V semaphores and messages (as opposed to System V shared memory); see the semop(2) and msgop(2) man pages

Troubleshooting

This section provides a guide to various error messages that could appear during checkpoint and restart operations, and what these messages might indicate.

Failure to Checkpoint

Checkpointing can fail for any of the reasons shown in Table 2-2.

Table 2-2. Checkpoint Failure Messages

Error Message

Problem Indicated

Permission denied

Search permission denied on a pathname component of statefile.

Resource busy

A resource required by the target process is in use by the system.

Checkpoint error

An uncheckpointable resource is associated with the target process.

File exists

The pathname designated by statefile already exists.

Invalid argument

An invalid argument was passed to a function call.

Too many symbolic links

A symbolic link loop occurred during pathname resolution.

No such file or directory

The pathname to statefile is nonexistent.

Not a directory

A component of the path prefix is not a directory.

Filename too long

The pathname to statefile exceeds the maximum length allowed.

No space left on device

Space remaining on disk is insufficient for the statefile.

Operation not permitted

The calling process does not have appropriate privileges.

Read-only file system

The requested statefile would reside on a read-only filesystem.

No such process

The process or process group specified by ID does not exist.


Failure to Restart

Restart can fail for any of the reasons shown in Table 2-3.

Table 2-3. Restart Failure Messages

Error Message

Problem Indicated

Permission denied

Search permission denied on a path component of statefile.

Resource temporarily unavailable

Total number of processes for user exceeds system limit.

Checkpoint error

An unrestartable resource is associated with target process.

Resource deadlock avoided

Attempted locking of a system resource would have resulted in a deadlock situation.

Invalid argument

An invalid argument was passed to the function call.

Too many symbolic links

A symbolic link loop occurred during pathname resolution.

Filename too long

The pathname to statefile exceeds the maximum length.

No such file or directory

The pathname to statefile is nonexistent.

Not enough space

Restarting the target process requires more memory than allowed by the hardware or by available swap space.

Not a directory

A component of the path prefix is not a directory.

Operation not permitted

The real user ID of the calling process does not match the real user ID of one or more processes recorded in the checkpoint, or the calling process does not have appropriate privileges to restart one or more of the target processes.