Chapter 5. Having Trouble?

This chapter contains hardware-specific information that can be helpful if you are having trouble with your Onyx rackmount graphics workstation.

Maintaining Your Hardware and Software

This section gives you some basic guidelines to follow to keep your hardware and software in good working order.

Hardware Do's and Don'ts

To keep your system in good running order, follow these guidelines:

  • Do not enclose the system in a small, poorly ventilated area (such as a closet), crowd other large objects around it, or drape anything (such as a jacket or blanket) over the system chassis.

  • Do not place terminals on top of the system chassis.

  • Do not connect cables or add other hardware components while the system is turned on.

  • Do not power off the system frequently; leave it running over nights and weekends, if possible.

  • Do not leave the key switch in the Manager position.

  • Do not place liquids, food, or heavy objects on the system, terminal, or keyboard.

  • Degauss the monitor every few days by pressing the Degauss button on the front of the monitor.

  • Ensure that all cables are plugged in completely.

  • Ensure that the system has power surge protection.

  • Route all external cables away from foot traffic.

Software Do's and Don'ts

When your system is up and running, follow these guidelines:

  • Do not turn off power to a system that is currently running software.

  • Do not use the root account unless you are performing administrative tasks.

  • Make regular backups (weekly for the whole system, nightly for individual users) of all information.

  • Protect all accounts with a password. Refer to the IRIX Admin: System Configuration and Operation manual for information about installing a root password.

System Behavior

The behavior of a system that is not working correctly falls into three broad categories:

Operational 

You are can log in to the system, but it doesn't respond as usual. For example, the screen looks strange or the windows don't respond to input from the mouse or keyboard.

Marginal 

You are not able to start up the system fully, but you can reach the System Maintenance menu or PROM Monitor.

Faulty 

You cannot reach the System Maintenance menu or PROM Monitor.

If the behavior of your system is operational or marginal, first check for error messages on the System Controller display, then perform a physical inspection using the checklist in the following section. If all of the connections seem solid, restart the system. If the problem persists, run the diagnostic tests from the System Maintenance menu or PROM Monitor. See your IRIX Admin: System Configuration and Operation manual for more information about diagnostic tests.

If your system is faulty, turn the power to the main unit off and on. If this does not help, contact your system administrator.

Physical Inspection Checklist

Check every item on this list:

  • Make sure the monitor and main unit power switches are turned on.

  • If the system has power, check the System Controller display for any messages, then reset the system.

  • Make sure the mouse is connected and is on the mouse pad.

Before you continue, shut down the system and turn off the power.

Verify all of these cable connections:

  • The video cable is connected securely to the rear of the monitor and to the appropriate connector on the graphics I/O panel.

  • The power cable is securely connected to the monitor or terminal at one end and to the power source at the other end.

  • The keyboard cable is securely connected to the keyboard at one end and to the terminal at the other end.

  • The mouse cable is securely connected to the keyboard.

  • The system power cable is securely installed in the receptacle in the system chassis and in the proper AC outlet.

  • The network cable is connected to the appropriate port. The key or lock used to secure the network connection is engaged.

  • Serial port cables are securely installed in their corresponding connectors.

When you finish checking the hardware connections, turn on the power to the main unit and then to the terminal; then reboot the system. If your system continues to fail, restore the system software and files using the procedures described in the IRIX Admin: Software Installation and Licensing manual. If the system fails to respond at all, call your system administrator or service organization for assistance.

Using the System Controller

This section explains several ways to use the System Controller to diagnose system faults. The operator-selectable functions are described, as well as some common faults and the symptoms they exhibit.

You can select one of four menus when the System Controller key switch is in the On (middle) position:

  • CPU Activity Display menu

  • Boot Status menu

  • Event History Log menu

  • Master CPU Selection menu

The CPU Activity Display

The CPU Activity Display is a histogram that represents the activity of each system processor as a vertically moving bar. This is the default display and appears continuously unless an error occurs or a function key is pressed.

The Boot Status Menu

The Boot Status menu monitors the current state of the system during the boot arbitration process.

Table 5-1 lists the messages that may appear in this menu.

Table 5-1. Boot Status Menu Messages

Master CPU Selection Message

Context and Meaning of Message

BOOT ARBITRATION NOT STARTED

The system CPU board(s) have not begun the arbitration process.

BOOT ARBITRATION IN PROCESS

The System Controller is searching for the bootmaster CPU processor.

ARBITRATION COMPLETE BOARD OxZZ PROC OxZZ

The chosen bootmaster CPU has identified itself to the System Controller and communication is fully established.

BOOT ARBITRATION INCOMPLETE NO MASTER

An error has occurred in the boot process and no bootmaster CPU is communicating with the System Controller.


The Event History Log

The System Controller assigns space in its nonvolatile random access memory (NVRAM) for ten error and/or status messages. This space is referred to as the Event History Log.

If the system cannot completely boot, or if there are system problems, or if the system has shut down, check the System Controller display. The histogram in the display will have been replaced with one or more error messages from the Event History Log. Write down any error messages for use by your system administrator or by qualified service personnel. Refer to Appendix C for a complete listing of the possible error messages.


Note: When the system is rebooted, the System Controller will transmit the errors it has logged in NVRAM to the bootmaster CPU. They are then placed in /usr/adm/SYSLOG.


The Master CPU Selection Menu

The Master CPU Selection menu displays the last message sent by the Master CPU after the bootmaster arbitration process has completed. The four possible messages are identical to the Boot Status menu messages listed in Table 5-1.

The Power-On Process

During a normal power-on sequence, both the green power-on LED and the amber fault LED light. When the System Controller initializes and completes its internal diagnostics, the amber LED goes out.


Note: If the amber fault LED stays on for more than a few seconds, a fault message should appear. If the LED stays on and no message appears, the display may be faulty or there may be a problem with the System Controller. Contact your system administrator or service provider.

The following steps describes what you should see when you bring up the system:

  1. When the System Controller completes its internal checks and the system begins to come up, two boot messages appear:

    
    BOOT ARBITRATION IN PROCESS
    ARBITRATION COMPLETE BOARD OxZZ PROC OxZZ
    

    A flag message appears: Onyx C. 1996

  2. The screen clears and the message STARTING SYSTEM appears.

  3. A series of status messages scroll by. Most messages pass by so quickly that they are unreadable. These messages indicate the beginning or completion of a subsystem test.

  4. After all of the system checks are complete, you receive a status message that looks similar to:

    
    PROCESSOR STATUS
    B+++
    

The B+++ shown in step 4 indicates that the bootmaster CPU is active, along with three other functioning processors on the CPU board. If the bootmaster CPU has only two slave processors on board, you see


PROCESSOR STATUS
B+

If you receive a processor status message followed by B+DD, you have a CPU board with two of its processors disabled. Contact your system administrator to determine why the processors were disabled.

If you receive a processor status message like B+-- or B+XX, the CPU board has defective processors on board. Make a note of the exact message and contact your service provider for help.

If the System Hangs

If the system does not complete step 3 in the power-on process, an error message will appear and remain on the System Controller's display. Make a note of the exact message where the system stops, and contact your service provider.


Note: The message displayed on the System Controller display can provide the service person with valuable information.


If an Over-Temperature Error Occurs

If the system shuts down because an OVER TEMP condition occurs, the entire system powers down, including the System Controller. To find the fault, turn the key switch off and then on again. The display should show the origin of the OVER TEMP error. If the system immediately shuts down again, wait for several minutes to allow the mechanical temperature sensor switch to cool below its trip point.

Recovering from a System Crash

Your system might have crashed if it fails to boot or respond normally to input devices such as the keyboard. The most common form of system crash is terminal lockup—a situation where your system fails to accept any commands from the keyboard. Sometimes when a system crashes, data becomes damaged or lost.

Using the methods described in the following paragraphs, you can fix most problems that occur when a system crashes. You can prevent additional problems by recovering your system properly after a crash.

The following list presents a number of ways to recover your system from a crash. The simplest method, rebooting the system, is presented first. If it fails, go on to the next method, and so on. Here is an overview of the different crash recovery methods:

  • rebooting the system

    Rebooting usually fixes problems associated with a simple system crash.

  • restoring system software

    If you do not find a simple hardware connection problem and you cannot reboot the system, a system file might be damaged or missing. In this case, you need to copy system files from the installation tapes to your hard disk. Some site-specific information might be lost.

  • restoring from backup tapes

If restoring system software fails to recover your system fully, you must restore from backup tapes. Complete and recent backup tapes contain copies of important files. Some user- and site-specific information might be lost. Read the following section for information on file restoration.

Restoring a Filesystem From the System Maintenance Menu

If your root filesystem is damaged and your system cannot boot, you can restore your system from the System Maintenance Menu. This is the menu that appears when you interrupt the boot sequence before the operating system takes over the system. To perform this recovery, you need two different tapes: your system backup tape and a bootable tape with the miniroot.

If a backup tape is to be used with the System Recovery option of the System Maintenance Menu, it must have been created with the System Manager or with the Backup(1) command, and must be a full system backup (beginning in the root directory (/) and containing all the files and directories on your system). Although the Backup command is a front-end interface to the bru(1) command, Backup also writes the disk volume header on the tape so that the “System Recovery” option can reconstruct the boot blocks, which are not written to the tape using other backup tools. For information on creating the system backup, see the IRIX Admin: Backup, Security, and Accounting manual.

If you do not have a full system backup made with the Backup command or System Manager—and your root or usr filesystems are so badly damaged that the operating system cannot boot—you have to reinstall your system.

If you need to reinstall the system to read your tapes, install a minimal system configuration and then read your full system backup (made with any backup tool you prefer) over the freshly installed software.

This procedure should restore your system to its former state.


Caution: Existing files of the same pathname on the disk are overwritten during a restore operation, even if they are more recent than the files on tape.

When you first start up your machine, you see the following prompt:

Starting up the system....
To perform system maintenance instead, press <Esc>

  1. Press the <Esc> key. You see the following menu:

    System Maintenance Menu
    1   Start System
    2   Install System Software
    3   Run Diagnostics
    4   Recover System
    5   Enter Command Monitor
    

  2. Enter the numeral 4 and press <Return>. You see the message:

    System Recovery...
    Press Esc to return to the menu.
    

    After a few moments, you see the message

    Insert the installation tape, then press <enter>: 
    

  3. Insert your bootable tape and press the <Enter> key. You see some messages while the miniroot is loaded. Next you see the message

    Copying installation program to disk....
    

    Several lines of dots appear on your screen while this copy takes place.

  4. You see the message

    CRASH RECOVERY
    You may type sh to get a shell prompt at most questions.
    Remote or local restore: ([r]emote, [l]ocal): [l]
    

  5. Press <Enter> for a local restoration. If your tape drive is on another system accessible by the network, press r and then the <Enter> key. You are prompted for the name of the remote host and the name of the tape device on that host. If you press <Enter> to select a local restoration, you see the message

    Enter the name of the tape device: [/dev/tape] 
    

    You may need to enter the exact device name of the tape device on your system, since the miniroot may not recognize the link to the convenient /dev/tape filename. As an example, if your tape drive is drive #2 on your integral SCSI bus (bus 0), the most likely device name is /dev/rmt/tps0d2nr. If it is drive #3, the device is /dev/rmt/tps0d3nr.

  6. The system prompts you to insert the backup tape. When the tape has been read back onto your system disk, you are prompted to reboot your system.

Recovery After System Corruption

From time to time you may experience a system crash due to file corruption. Systems cease operating (“crash”) for a variety of reasons. Most common are software crashes, followed by power failures of some sort, and least common are actual hardware failures. Regardless of the type of system crash, if your system files are lost or corrupted, you may need to recover your system from backups to its pre-crash configuration.

Once you repair or replace any damaged hardware, you are ready to recover the system. Regardless of the nature of your crash, you should reference the information in the section “Restoring a Filesystem From the System Maintenance Menu” in the IRIX Admin: Backup, Security, and Accounting manual.

The System Maintenance Menu recovery command is designed for use as a full backup system recovery. After you have done a full restore from your last complete backup, you may restore newer files from incremental backups at your convenience. This command is designed to be used with archives made using the Backup utility or through the System Manager. The System Manager is described in detail in the Personal System Administration Guide. System recovery from the System Maintenance Menu is not intended for use with the tar, cpio, dd, or dump utilities. You can use these other utilities after you have recovered your system.

You may also be able to restore filesystems from the miniroot. For example, if your root filesystem has been corrupted, you may be able to boot the miniroot, unmount the root filesystem, and then use the miniroot version of restore, xfs_restore, bru, cpio, or tar to restore your root filesystem. Refer to the reference (man) pages on these commands for details on their application.

Refer to the IRIX Admin: System Configuration and Operation manual for instructions on good general system administration practices.