Chapter 5. Having Trouble?

This chapter contains hardware-specific information that can be helpful if you are having trouble with your Challenge rackmount server.

Maintaining Your Hardware and Software

This section gives you some basic guidelines to follow to keep your hardware and software in good working order.

Hardware Do's and Don'ts

To keep your system in good running order, follow these guidelines:

  • Do not enclose the system in a small, poorly ventilated area (such as a closet), crowd other large objects around it, or drape anything (such as a jacket or blanket) over the system chassis.

  • Do not place terminals on top of the system chassis.

  • Do not connect cables or add other hardware components while the system is turned on.

  • Do not power off the system frequently; leave it running over nights and weekends, if possible.

  • Do not leave the key switch in the Manager position.

  • Do not place liquids, food, or heavy objects on the system, terminal, or keyboard.

  • Ensure that all cables are plugged in completely.

  • Ensure that the system has power surge protection.

  • Route all external cables away from foot traffic.

Software Do's and Don'ts

When your system is up and running, follow these guidelines:

  • Do not turn off power to a system that is currently running software.

  • Do not use the root account unless you are performing administrative tasks.

  • Make regular backups (weekly for the whole system, nightly for individual users) of all information.

  • Protect all accounts with a password. Refer to the IRIX Advanced Site and Server Administration Guide for information about installing a root password.

System Behavior

The behavior of a system that is not working correctly falls into three broad categories:

Operational 

You are able to log in to the system, but it doesn't respond as usual.

Marginal 

You are not able to start up the system fully, but you can reach the System Maintenance menu or PROM Monitor.

Faulty 

You cannot reach the System Maintenance menu or PROM Monitor.

If the behavior of your system is operational or marginal, first check for error messages on the System Controller display, then perform a physical inspection using the checklist in the following section. If all of the connections seem solid, restart the system. If the problem persists, run the diagnostic tests from the System Maintenance menu or PROM Monitor. See your IRIX Admin: System Configuration and Operation Manual for more information about diagnostic tests.

If your system is faulty, turn the power to the main unit off and on. If this does not help, contact your system administrator.

Physical Inspection Checklist

Check every item on this list:

  • Make sure the terminal and main unit power switches are turned on.

  • If the system has power, check the System Controller display for any messages, then reset the system.

Before you continue, shut down the system and turn off the power.

Verify all of these cable connections:

  • The video cable is connected securely to the rear of the terminal and to the appropriate connector on the main I/O panel.

  • The power cable is securely connected to the terminal at one end and to the power source at the other end.

  • The keyboard cable is securely connected to the keyboard at one end and to the terminal at the other end.

  • The system power cable is securely installed in the receptacle in the system chassis and in the proper AC outlet.

  • The network cable is connected to the appropriate port. The key or lock used to secure the network connection is engaged.

  • Serial port cables are securely installed in their corresponding connectors.

When you finish checking the hardware connections, turn on the power to the main unit and then to the terminal; then reboot the system. If your system continues to fail, restore the system software and files using the procedures described in the IRIX Advanced Site and Server Administration Guide. If the system fails to respond at all, call your service organization for assistance.

Using the System Controller

This section explains several ways to use the System Controller to diagnose system faults. The operator-selectable functions are described, as well as some common faults and the symptoms they exhibit.

You can select one of four menus when the System Controller key switch is in the On (middle) position:

  • CPU Activity Display menu

  • Boot Status menu

  • Event History Log menu

  • Master CPU Selection menu

The CPU Activity Display

The CPU Activity Display is a histogram that represents the activity of each system processor as a vertically moving bar. This is the default display and appears continuously unless an error occurs or a function key is pressed.

The Boot Status Menu

The Boot Status menu monitors the current state of the system during the boot arbitration process. Table 5-1 lists the messages that may appear in this menu.

Table 5-1. Boot Status Menu Messages

Master CPU Selection Message

Context and Meaning of Message

BOOT ARBITRATION NOT STARTED

The system CPU board(s) have not begun the arbitration process.

BOOT ARBITRATION IN PROCESS

The System Controller is searching for the bootmaster CPU processor.

ARBITRATION COMPLETE BOARD OxZZ PROC OxZZ

The chosen bootmaster CPU has identified itself to the System Controller and communication is fully established.

BOOT ARBITRATION INCOMPLETE NO MASTER

An error has occurred in the boot process and no bootmaster CPU is communicating with the System Controller.


The Event History Log

The System Controller assigns space in its nonvolatile random access memory (NVRAM) for ten error and/or status messages. This space is referred to as the Event History Log.

If the system cannot completely boot, or if there are system problems, or if the system has shut down, check the System Controller display. The histogram in the display will have been replaced with one or more error messages from the Event History Log. Write down any error messages for use by your system administrator or by qualified service personnel. Refer to Appendix C for a complete listing of the possible error messages.


Note: When the system is rebooted, the System Controller will transmit the errors it has logged in NVRAM to the bootmaster CPU. They are then placed in /usr/adm/SYSLOG.


The Master CPU Selection Menu

The Master CPU Selection menu displays the last message sent by the Master CPU after the bootmaster arbitration process has completed. The four possible messages are identical to the Boot Status menu messages listed in Table 5-1.

The Power-On Process

During a normal power-on sequence, both the green power-on LED and the amber fault LED light. When the System Controller initializes and completes its internal diagnostics, the amber LED goes out.


Note: If the amber fault LED stays on for more than a few seconds, a fault message should appear. If the LED stays on and no message appears, the display may be faulty or there may be a problem with the System Controller. Contact your system administrator or service provider.

The following steps describes what you should see when you bring up the system:

  1. When the System Controller completes its internal checks and the system begins to come up, two boot messages appear:

    BOOT ARBITRATION IN PROCESS
    ARBITRATION COMPLETE BOARD OxZZ PROC OxZZ
    

    A flag message appears: Onyx C. 1993

  2. The screen clears and the message STARTING SYSTEM appears.

  3. A series of status messages scroll by. Most messages pass by so quickly that they are unreadable. These messages indicate the beginning or completion of a subsystem test.

  4. After all of the system checks are complete, you receive a status message that looks similar to:

    PROCESSOR STATUS
    B+++
    

The B+++ shown in step 4 indicates that the bootmaster CPU is active, along with three other functioning processors on the CPU board. If the bootmaster CPU has only two slave processors on board, you see:

PROCESSOR STATUS
B+

If you receive a processor status message followed by B+DD, you have a CPU board with two of its processors disabled. Contact your system administrator to determine why the processors were disabled.

If you receive a processor status message like B+-- or B+XX, the CPU board has defective processors on board. Make a note of the exact message and contact your service provider for help.

If the System Hangs

If the system does not complete step 3 in the power-on process, an error message will appear and remain on the System Controller's display. Make a note of the exact message where the system stops, and contact your service provider.


Note: The message displayed on the System Controller display can provide the service person with valuable information.


If an Over-Temperature Error Occurs

If the system shuts down because an OVER TEMP condition occurs, the entire system powers down, including the System Controller. To find the fault, turn the key switch off and then on again. The display should show the origin of the OVER TEMP error. If the system immediately shuts down again, wait for several minutes to allow the mechanical temperature sensor switch to cool below its trip point.

Recovering from a System Crash

Your system might have crashed if it fails to boot or respond normally to input devices such as the keyboard. The most common form of system crash is terminal lockup—a situation where your system fails to accept any commands from the keyboard. Sometimes when a system crashes, data becomes damaged or lost.

Using the methods described in the following paragraphs, you can fix most problems that occur when a system crashes. You can prevent additional problems by recovering your system properly after a crash.

The following list presents a number of ways to recover your system from a crash. The simplest method, rebooting the system, is presented first. If it fails, go on to the next method, and so on. Here is an overview of the different crash recovery methods:

  • rebooting the system

    Rebooting usually fixes problems associated with a simple system crash.

  • restoring system software

    If you do not find a simple hardware connection problem and you cannot reboot the system, a system file might be damaged or missing. In this case, you need to copy system files from the installation tapes to your hard disk. Some site-specific information might be lost.

  • restoring from backup tapes

    If restoring system software fails to recover your system fully, you must restore from backup tapes. Complete and recent backup tapes contain copies of important files. Some user- and site-specific information might be lost.

Refer to your IRIX Admin: Backup, Security, and Accounting Manual for instructions for each of the recovery methods listed above.