Appendix B. Troubleshooting System Configuration Using System Error Messages

This appendix lists error messages you may receive from the IRIX operating system and offers references to appropriate areas of the documentation describing the system functions that are likely related to the error message.

System errors are often accompanied by several error messages that together will lead you to determining the actual problem. Read the section for each message and use your best judgement to determine which messages reflect the core problem and which are caused by the effects of the core problem on other parts of the system.

For error messages not covered in this guide or related IRIX administration guides, see the file /usr/include/sys/errno.h, the intro(2) man page, and the owner's guide for your system.

Some error messages are customized with specific information from your system configuration. Where this is the case, the messages listed in this appendix may contain an ellipsis (...) to indicate specific information that has been left out of the example, or the notation is made that the message you receive may be similar to the listed message, rather than an exact match.

Disk Space Messages

The following messages deal with standard disk operations, such as messages indicating you are low on or out of available disk space.

unix: <disk id>: Process ... ran out of disk space
unix: <disk id>: Process ... ran out of contiguous space

If the disk becomes completely full, you will not be able to create new files or write additional information to existing files. If the system disk becomes full, your system may not respond to commands.


Note: Do not shut down or restart the system until you free up some disk space. Without free disk space, the system may not be able to restart properly.

Please release disk space in one of these ways:

  1. Empty your dumpster by choosing Empty Dumpster from the Desktop toolchest.

  2. Remove or archive old or large files or directories.

    • To find old or large files, double-click the launch icon to start the Search tool, then use its online help.

    • It's a good idea to search for files named core; these are often very large, and are created by an application when it encounters a problem.

    • If you remove the files from the desktop, empty your dumpster again.

    • To archive files (copy them onto a backup tape), use the Backup & Restore tool; start it by double-clicking the launch icon.

  3. If your system disk is almost full, check:

    • /var/tmp and /tmp are public directories that often become full; delete unwanted files or directories that you find here.

    • If the /var/adm/SYSLOG file seems very large (over 200 KB), remove all but a few lines of it; do not remove the entire file.

    • When the system has a serious failure (crash), it places information into two /var/adm/crash files: vmcore.number and unix.number. If you find files with these names, back them up to tape so you can give the files to your local support organization. Then remove the files from your system.

    • If you remove the files from the desktop, empty your dumpster again.

    • mbox is in all home directories. If these files are large, ask the owners to please delete all but critical messages.

  4. Remove optional or application software; to start the Software Manager, choose Software Manager from the System toolchest.

If you want to be notified at a different level of disk use (for example, to be notified when the disk is 90% full), a privileged user can follow the steps in Procedure B-1:

Procedure B-1. Disk Usage Notification

  1. Start the Disk Manager by double-clicking the launch icon.

  2. Click the button beneath the photo of the disk whose warning threshold you want to change (the system disk is labeled 0,1).

  3. Highlight the number in the Notify when field, type 90, then click OK.

Also, see “Disk Usage Commands” in Chapter 6.

General System Messages

The following messages indicate general system configuration issues that should be noted or attended to.

File Permission Issues

You may see the message:

unix: WARNING: inode 65535: illegal mode 177777

This message indicates that a program attempted to access a file to which it has no access permission.

The find command can be used to identify the filename, permissions, and directory path that the file resides in. Use the -inum option of find to specify the inode number in order to locate the file. Once the file has been located, the permissions of the file can be changed by the owner of the file or the superuser (root).

IP (Network) Address Issues

The following issues pertain to the basic configuration of your system for the network and the immediate networking hardware.

Default Internet Address

You may see the message:

unix: network: WARNING IRIS'S Internet address is the default

The IP address is not set on this system. To set the IP address, see “Setting the Network Address ” in Chapter 4.

Duplicate IP Address

You may see the messages:

unix: arp: host with ethernet address ... is using my IP address ...
unix: arp: host with ethernet address ... is still using my IP address ...

Your system's IP address is the same as another system's address. Each system on the network must have a unique IP address so the network can send information to the correct location. Typically an address conflict occurs when a new system is added to the network and the system administrator assigns an IP address that is already in use.

Check to make certain your system is using the correct IP address and then contact the owner of the other system to determine which system needs to change its address.

Ethernet Cable Issues

You may see the following messages:

unix: ec0: no carrier: check Ethernet cable

This message indicates that the Ethernet cable has become loose, or some connection has been lost. This message may appear if the other systems on the network have been shut down and turned off.

unix: ec0: late xmit collision

This message indicates that the Ethernet interface has detected a transmission by another system on the network that was sent beyond the boundaries of the Ethernet standard transmission timing.

The most common causes of the late collisions are due to networks that have been configured outside of the network specification boundaries:

  • Long AUI transceiver cables. The length of the AUI cable should not exceed 50 meters.

  • New or recently added network segments that extend the network's total length.

  • Faulty or failing transceivers.

  • Faulty or failing Ethernet controllers.

If the problem persists, contact your network administrator or your local support provider.

root Filesystem Not Found

If your root filesystem is not found at boot time, check to be sure that there is not an incorrect ROOT variable set in the PROM. If there is an incorrect ROOT variable, simply enter unsetenv ROOT at the monitor prompt and then reboot.

login and su Issues

These messages are typically informational. They appear when another user attempts to log on to the system or use another account and the attempt either fails or succeeds.

login Messages

You may see the message:

login: failed: <user> as <user>

The login failures indicate that the user did not specify the correct password or misspelled the login name. The /var/adm/SYSLOG file contains the hostname that the user attempted to log in from and the account (username) the user attempted to log in to on the local system.

su Messages

You may see the message:

su: failed: ttyq# changing from <user> to root 

The su messages are typically informational. They appear when a user attempts to switch user accounts. Typically users are attempting to become root (superuser) when this message appears.

The su failures indicate that the user did not specify the correct password or misspelled the login name. The /var/adm/SYSLOG file contains the hostname, tty port number (ttyq#), the name of the user that attempted to perform the su command, and the account the user attempted to use.

Network Bootup Issues

This message indicates that the bootp program was remotely invoked from your system:

bootp: reply boot file /dev/null 

Usually a filename that is given after the bootp command contains code that can remotely startup a remote system. This startup file can be used to restart a diskless system, boot an installation program (sa), boot a system into sash, or boot X-terminals. The bootp program is a communications program that talks between the systems and the remote network device and facilitates the reading of the startup file across the network.

Operating System Rebuild Issues

You may see the following message:

lboot: Automatically reconfiguring the operating system

This informational message indicates that there has been a change to one or more of the operating system files or to the system hardware since the system last restarted. The system may automatically build a new kernel to incorporate the changes, and the changes should take place once the system has been rebooted.

The operating system file changes that can cause this message to be displayed include installing additional software that requires kernel modifications and additions or changes in some kernel tunable parameters. If this message appears every time the system boots, then check the date on one of the operating system files. The date on the file may have been set to a date in the future.

Power Failure Detected

You may see the following message:

unix: WARNING: Power failure detected

This message indicates that the system has detected that the AC input voltage has fallen below an acceptable level. This is an informational message that is logged to /var/adm/SYSLOG.

Although this is an informational message, it's a good idea to check all of the AC outlets and connections, and check the system components for disk drive failures or overheated boards. On Challenge and Onyx systems, the system controller attempts to gracefully shut down the system; this includes stopping all processes and synchronizing the disk.

Redundant Power Supply Unit Failure Detected

You may see the following message:

MAINT-NEEDED: module xx MSC: Power Supply Redundancy is Lost

This message indicates that the system has detected a failure in a redundant power supply unit in module xx. The unit may have failed during operation or when installed. The system controller interrupt handler monitors the failure and sends this informational message to /var/adm/SYSLOG.

The failing redundant power supply unit must be repaired or replaced. Contact your support provider.

When the redundant power supply unit has been fixed, a subsequent information message is sent to /var/adm/SYSLOG:

module xx MSC: Power Supply Redundancy is Restored

SCSI Controller Reset

You may see messages similar to the following:

unix: wd93 SCSI Bus=0 ID=7 LUN=0: SCSI cmd=0x12 timeout after 10 sec. Resetting SCSI
unix: Integral SCSI bus ... reset

This message indicates that the SCSI controller has made an inquiry of the device (where the ID number is located) and it did not respond.

This message is a notification to the user that the system has encountered a problem accessing the SCSI device. There are several reasons why this message may have been displayed:

  • The SCSI device that was being accessed does not support the type of inquiry that the controller has made.

  • There is a physical problem with the SCSI device or controller.

If this message continues to appear, look at the /var/adm/SYSLOG file and see if there are any messages that follow this one to help isolate or identify the problem, or contact your local support provider.

syslogd Daemon Issues

You may see the following message:

syslogd: restart

The syslogd messages are typically informational only. The messages indicate the start and stop of the syslogd daemon. These messages are written to /var/adm/SYSLOG when the system is shut down or rebooted.

System Clock and Date Issues

There are messages that inform you of system clock and date issues.

System Off Message

You may see this message:

unix: WARNING: clock gained ... days

This is an informational message that indicates that the system has been physically turned off for x number of days (where x is indicated by the message found in /var/adm/SYSLOG).

To correct this problem, you should reset the system date and time. For more information on setting the system time, see the date(1) man page and “Changing the System Date and Time” in Chapter 4 of this guide.

Time and Date Messages

You may see this message:

unix: WARNING: CHECK AND RESET THE DATE!

This message is typically preceded by several different types of time and date messages. Some of the messages are informational, and others indicate a problem with the system date, time, or hardware. Check the log file /var/adm/SYSLOG for other clock, date, or time-related problems. If you do not see any other date, time, or clock messages, try setting the verbose option of chkconfig on.

For more information on setting the date, and time, see the date(1) man page and “Changing the System Date and Time” in Chapter 4 of this guide. For chkconfig options, refer to the chkconfig(1M) man page.

Time Server Daemon Message

timed: slave to gate-host-name

The time server daemon ( timed) logs informational entries into /var/adm/SYSLOG. No action is required by the user. The timed daemon will automatically perform the necessary adjustments. Refer to the timed(1M) man page for more information about the time server daemon.

System Restarting Information

You may see the following messages:

INFO: The system is shutting down.
INFO: Please wait.
syslogd: going down on signal 15
syslogd: restart
unix: [IRIX Release ...
unix: Copyright 1987-2000 Silicon Graphics, Inc
unix: All Rights Reserved.

The messages logged during system startup contain information about the operating system environment that the system is using. The startup messages include the version of the IRIX operating system that is loaded on the system, and Silicon Graphics copyright information. The operating system version information can be helpful to support providers when you report any problems that the system may encounter.

The messages that are logged during system shutdown are also sent to /var/adm/SYSLOG. These are informational messages that are broadcast to all users who are logged onto the system and the system console. There is no action required.

Trap Held or Ignored

You may see these messages:

unix: WARNING: Process ... generated trap, but has signal 11 held or ignored
unix: Process ... generated trap, but has signal 11 held or ignored

This message indicates that the process is an infinite loop, therefore the signal/trap message that followed was held or ignored.

This message is usually caused by a temporary out-of-resources condition or a bug in the program. If it is a resource issue, you should be able to execute the program again without seeing this message again. If the message reappears after executing the program, you might have encountered a bug in the program.

Memory and Swap Messages

The following messages you may see deal with issues of system memory and swap space, and the way the system manages these resources.

growreg Insufficient Memory

You may see this message:

unix: growreg - temporarily insufficient memory to expand region

This message indicates that there is not enough memory (both real and virtual) available for use by programs running on your system. There is no memory available to start any new processes or programs.

If this message continues to appear, you can correct the problem as directed in the troubleshooting section titled “Swapping and Paging Messages”.

Panic Page Free

If you see this message:

PANIC: panic pagefree: page use 0 

This indicates that the system thinks that an address of a page of memory is out of the legal range of values, or that the system is trying to free a page of memory that is already marked as free.

This panic message results in the system halting immediately and ungracefully. When the system halts, it attempts to save the contents of the kernel stack and memory usage information in a crash file. The page free panic is usually caused by a physical memory problem or possible disk corruption. If this message occurs again, contact your local support provider.

Physical Memory Problems

You may see a message similar to this:

unix: CPU Error/Addr 0x301 <RD PAR>: 0xbd59838 

Your system contains several modular banks of random-access memory; each bank contains a SIMM. One or more of these SIMMs is either loose or faulty. You must correct the problem so that your system and software applications can run reliably.

Follow the steps in Procedure B-2:

Procedure B-2. SIMM Checklist

  1. Shut down the system.

  2. Refer to your owner's guide.

    • If it shows you how to visually identify a loose SIMM and the SIMM is loose, reseat it. If the SIMM is not loose, you may need to replace it. Contact your local support organization.

    • If your owner's guide does not contain information about SIMMs, contact your local support organization.

Recoverable Memory Errors

The following are informational messages. They indicate that the hardware detected a memory parity error and was able to recover from the parity condition. No action is required unless the frequency of this message increases. Please note the hexadecimal number, which represents the memory location in a SIMM.

unix: Recoverable parity error at or near physical address 0x9562f68 <0x308>, Data: 0x8fbf001c/0x87b00014

This message indicates that the system has tried to read a programs allotted memory space and an error has been returned. The error that returns usually indicates a memory parity error.

unix: Recoverable memory parity error detected by CPU at 0x9cc4960 <0x304> code:30

This is an informational message that indicates that the Central Processing Unit (CPU) detected a memory parity error and is reporting it to /var/adm/SYSLOG. No action is required unless the frequency of this message increases. Please note the hexadecimal number, which represents the memory location in a SIMM.

unix: Recoverable memory parity error corrected by CPU at 0x9cc4960 <0x304> code:30

This is an informational message that indicates that the central processing unit (CPU) detected a memory parity error and was able to recover from the parity condition. No action is required unless the frequency of this message increases. Please note the hexadecimal number, which represents the memory location in a SIMM.

savecore I/O Error

If you see this message:

savecore: read: I/O Error

This message indicates that when the system rebooted after a system crash, the program savecore was not able to read /dev/swap in order to save the memory core image at the time of the crash.

The program savecore is executed after the system restarts with superuser permissions. If savecore was not able to read the memory core image (/dev/swap), then it is possible that you have disk problems within the swap partition. This message might be followed by disk error messages. If the problem persists, then you should contact your local support provider.

Swapping and Paging Messages

Swapping and paging are the methods by which the operating system manages the limited memory resources in order to accommodate multiple processes and system activities. In general, the operating system keeps in actual RAM memory only those portions of the running programs that are currently or recently in use. As new sections of programs are needed, they are paged (or “faulted”) into memory, and as old sections are no longer needed they are paged out.

 Swapping is similar to paging, except that entire processes are swapped out, instead of individual memory pages, as in paging. The system maintains a section of hard disk for swapping. If this space is filled, no further programs can be swapped out, and thus no further programs can be created.

The following messages may indicate a swapping or paging problem:

Swap out failed on logical swap

For some reason, the operating system was unable to write the program to the swap portion of the disk. No action is necessary as the process is still in memory. See “Swap Space” in Chapter 6.

Paging Daemon (vhand) not running - NFS server down?

The system determines that vhand is not executing, possibly because it is waiting for an I/O transfer to complete to an NFS server (especially if the NFS filesystem is hand mounted). No action should be necessary as the system will restart vhand when needed.

bad page in process (vfault)

The page being faulted into memory is not a recognized type. The recognized types are demand fill, demand zero, in file, or on swap. Reboot your system and if the error persists, check your application and your disk.

unix: WARNING: Process ... killed due to bad page read

The page being faulted into memory could not be read for some reason and the process was killed. Restart the program or reboot the system, and if the error persists, check your application and your disk.

unix: Process ... killed due to insufficient memory/swap

Your system uses a combination of physical memory (RAM) and a small portion of the disk (swap space) to run applications. Your system does not have enough memory and swap space at this time. It had to stop a program from running to free up some memory.

unix: ... out of logical swap space during fork - see swap(1M)

Your system does not have enough memory and swap space at this time. It could not start a new process.

If you run out of swap space or memory frequently, you should take the steps in Procedure B-3:

Procedure B-3. Not Enough Memory and Swap Space Checklist

  1. Exit from applications when you are not using them. Remember to check all your desks.

  2. Order additional memory from your local support or sales organization.

  3. Turn on virtual swap space. Refer to the swap(1M) man page and “Swap Space” in Chapter 6 first.

    The administrator should log in as root and enter the command:

    chkconfig

    If the chkconfig listing shows a line that says vswap off, give the commands:

    chkconfig vswap on

    /etc/init.d/swap start

    If vswap was already on, go on to the next step.

  4. Create a file that the system can use for additional swap space. Note that this decreases your available disk space by the size of the file. If you create a 10 MB swap file, you will no longer have access to that 10MB of disk space.

    To create a 10 MB swap file, the administrator should log in as root and enter these commands:

    mkdir -p /var/swap

    /usr/sbin/mkfile 10m /var/swap/swap1

    /sbin/swap -a /var/swap/swap1

    To make this permanent, so you have the swap space available every time you restart the system, add this line to the /etc/fstab file:

    /var/swap/swap1 swap swap pri=3

    For more information, see the swap(1M) man page or “Swap Space” in Chapter 6.

  5. You can permanently increase swap space by repartitioning the disk. You can find instructions to do this in IRIX Admin: Disks and Filesystems .

Other Memory Messages

You may see the following error messages or similar messages from time to time:

unix: Memory Parity Error in SIMM S9 (Bank 0, Byte 2)

The CPU detected a memory parity error in the listed SIMM. A parity error indicates that some or all of the individual memory bits may have been incorrectly read or written. Note the SIMM information and reboot the system. If the same SIMM shows repeated errors, check the SIMM as described in “Physical Memory Problems”.

unix: Process...sent SIGBUS due to Memory Error in SIMM S9

Note the SIMM information and reboot the system. If the same SIMM shows repeated errors, check the SIMM as described in “Physical Memory Problems”.

Ran out of action blocks

A resource used by the multiprocessor kernel for inter-CPU interrupts has run out. If this happens frequently, use the systune(1M) command to increase the value of the nactions parameter as described in the section titled “Operating System Tuning” in Chapter 10.

mfree map overflow

Fragmentation of some resource (such as message queues) contributed to the loss of some of the resource. No action is necessary.

System Panic Messages

The following messages indicate problems that should be resolved by rebooting the system. You should not be overly concerned about these instances unless they become frequent. There are other PANIC messages that are generated by the kernel not listed here. Follow these instructions for all PANIC messages.

bumprcnt - region count list overflow

This message indicates an unresolvable problem with the operating system. Reboot your system.

PANIC: kernel fault

This message indicates that the kernel experienced an unresolvable problem and shut itself down. By the time you see this message in the system message log, you will have rebooted the system. Note the message exactly on paper in your system log book for reference if it happens again.

The system is said to have panicked if the software or hardware reached a state where it could no longer operate. If the system fails again, or if you receive an unusually large number of error messages, please contact your local support provider. It helps the support provider if you save this information:

  • If there are any files in the /var/adm/crash directory, back them up to tape. Double-click the launch icon to start the Backup & Restore tool.

  • After you back up the files, you can remove them.

  • Check the System Log Viewer and save all the messages that you see. Double-click the launch icon to start the System Log Viewer.

  • Have you changed any kernel tunable parameters recently? If so, try resetting them to their former or default or self-configuring settings. See “Operating System Tuning” in Chapter 10.