Chapter 5. Troubleshooting

This chapter contains hardware-specific information that can be helpful if you are having trouble with your Power Challenge or Challenge L deskside system.

Maintaining Your Hardware and Software

This section gives you some basic guidelines to help keep your hardware and the software that runs on it in good working order.

Hardware Do's and Don'ts

To keep your system in good running order, follow these guidelines:

  • Do not enclose the system in a small, poorly ventilated area (such as a closet), crowd other large objects around it, or drape anything (such as a jacket or blanket) over the system.

  • Do not connect cables or add other hardware components while the system is turned on.

  • Always remove the key from the front panel switch before shutting the drive door or minor damage may result.

  • Do not leave the front panel key switch in the Manager position.

  • Do not lay the system on its side.

  • Do not power off the system frequently; leave it running over nights and weekends, if possible. The system console terminal can be powered off when it is not being used.

  • Do not place liquids, food, or extremely heavy objects on the system or keyboard.

  • Ensure that all cables are plugged in completely.

  • Ensure that the system has power surge protection.

Software Do's and Don'ts

When your system is up and running, follow these guidelines:

  • Do not turn off power to a system that is currently started up and running software.

  • Do not use the root account unless you are performing administrative tasks.

  • Make regular backups (weekly for the whole system, nightly for individual users) of all information.

  • Keep two sets of backup tapes to ensure the integrity of one set while doing the next backup.

  • Protect the root account with a password:

    • Check for root UID = 0 accounts (for example diag) and set passwords for these accounts.

    • Consider giving passwords to courtesy accounts such as guest
      and lp.

    • Look for empty password fields in the /etc/passwd file.

System Behavior

The behavior of a system that is not working correctly falls into three broad categories:

Operational 

You can log in to the system, but it doesn't respond as usual. For example, the text looks strange, or the monitor doesn't respond to input from the keyboard.

Marginal 

You cannot start up the system fully, but you can reach the System Maintenance menu or PROM Monitor.

Faulty 

The system has shut down and you cannot reach the System Maintenance menu or PROM Monitor.

If the behavior of your system is operational, marginal, or faulty, first do a physical inspection using the checklist below. If all of the connections seem solid, go on to the section “Using the System Controller” and try to isolate the problem. If the problem persists, run the diagnostic tests from the System Maintenance menu or PROM Monitor. See the IRIX Admin: System Configuration and Operation manual for more information about diagnostic tests.

If this does not help, contact your system administrator or service provider.

Physical Inspection Checklist

Check every item on this list:

  • The console terminal and main unit power switches are turned on.

  • The circuit breaker next to the main power cord is not tripped.

  • The fans are running and the fan inlets/outlets are not blocked.

  • The System Controller LCD screen may display fault messages or warnings.

Before you continue, shut down the system and turn off the power.

Check all of the following cable connections:

  • The system console terminal power cable is securely connected to the terminal at one end and the power source at the other end.

  • The Challenge deskside server power cable is securely connected to the main unit at one end and plugged into the proper AC outlet at the other end.

  • The Ethernet cable is connected to the 15-pin connector port labeled Ethernet (and secured with the slide latch).

  • Serial port cables are plugged in securely to their corresponding connectors.

  • All cable routing is safe from foot traffic.

If you find any problems with hardware connections, have them corrected and turn on the power to the main unit. Use the System Controller to determine if internal system problems exist.

Using the System Controller

The System Controller has three basic operating modes:

  • It acts as a control conduit when directed by an operator to power off or boot up the system. It actively displays a running account of the boot process and flags any errors encountered. It sends the master CPU a message when a system event such as power off or a reboot is initiated.

  • When operating conditions are within normal limits, the System Controller is a passive monitor. Its front panel LCD offers a running CPU activity graph that shows the level of each on-board microprocessor's activity. Previously logged errors are available for inspection using the front panel control buttons to select menus.

  • The System Controller can also act independently to shut down the system when it detects a threatening condition. Or it can adjust electromechanical parameters (such as blower fan speed) to compensate for external change. Error information stored in the log is available in both the On and Manager positions. Service personnel can use the Manager key position functions to probe for system error information.

When a system fault occurs in the cardcage, ventilation system, or power boards, the System Controller turns off the power boards but leaves the 48-V and V5_AUX on. This allows the yellow fault LED to remain lit and the System Controller to continue functioning. If, for example, the System Controller displays the error message POKA FAIL, your service provider can do a visual inspection of POKA indicator LEDs throughout the system to locate the failed component.


Note: If the system shuts down because an OVER TEMP condition occurs, the entire system shuts down. To find the fault, turn the key off and then on again. The LCD screen should show the OVER TEMP error; however, if the system is not given enough time to cool below the switch-off point, the System Controller will shut down again.

The System Controller also shuts down the entire system if a 48 V overvoltage fault occurs. If the System Controller removes power due to an overvoltage condition, the operator must execute the log function, turn the power off, and then turn it back on again. These steps are necessary to successfully power on the system. The purpose of this function is to prevent the operator from repeatedly applying power when an overvoltage condition exists.

The Power-On Process

You can monitor the boot process when you power on the system by watching the System Controller. When you turn the key switch to the On (middle) position on the System Controller front panel, it enables voltage to flow to the system backplane. The green power-on LED lights up, and immediately after that the yellow fault LED comes on. The System Controller initializes and performs its internal startup diagnostics. If no problems are found, the yellow fault LED shuts off.


Note: If the yellow fault LED stays on for more than a few seconds, a fault message should appear. If it stays on and no message appears on the display, you may have a faulty LCD screen or a problem with the System Controller. Contact your system administrator or service provider.

The following steps are similar to what you should see when you bring up the system:

  1. When the System Controller completes its internal checks and the system begins to come up, two boot messages appear:

    BOOT ARBITRATION IN PROGRESS
    BOOT ARBITRATION COMPLETE SLOT OxY PROC OxZ
    

  2. The screen clears and the message STARTING SYSTEM should appear.

  3. A series of status messages scrolls by. Most pass by so quickly that they are unreadable. These messages indicate the beginning or completion of a subsystem test.

  4. After all the system checks are complete, you receive a status message that looks similar to:

    PROCESSOR STATUS
    B+++
    

The B+++ shown in step 4 indicates that the bootmaster microprocessor is active along with three other functioning microprocessors on the CPU board.

If your CPU has only two microprocessors on board, you should see

PROCESSOR STATUS
B+

If you receive a processor status message followed by B+DD, you have a CPU with two of its microprocessors disabled. Contact your system administrator to determine why this was done.

If you receive a processor status message like B+-- or B+XX, the CPU has defective microprocessors on board. Make a note of the exact message and contact your service provider for help.

If the System Hangs

If the system does not make it through step 3 in the power-on process, an error message will appear and stay on the System Controller's LCD screen. A message like PD CACHE FAILED! indicates that a serious problem exists. Make a note of the final message the system displays and contact your service provider.

The message displayed on the System Controller LCD screen when a power-on hang occurs can give your service provider valuable information.

System Controller On Functions

Located just above the drive rack, the System Controller LCD and front panel provides users with information regarding any planned or unplanned shutdown of the system.

The System Controller monitors incoming air temperature and adjusts fan speed to compensate. It also monitors system voltages and the backplane clock. If an unacceptable temperature or voltage condition occurs, the System Controller will shut down the system.

Another major area the System Controller watches is the boot process. In the event of an unsuccessful boot, the controller's LCD panel indicates the general nature of the failure. A real-time clock resides on the System Controller, and the exact date and time of any shutdown is recorded.

When the System Controller detects a fault condition, it turns off power to the system boards and peripherals. The 48 VDC supplied to the system backplane stays on unless the shutdown was caused by an over-limit temperature condition or other situation that would be harmful to the system. The System Controller LCD screen displays a fault message, and the yellow fault LED near the top of the panel comes on. Fault LEDs are also positioned on other parts of the chassis to indicate a localized fault. Your service provider should check for these conditions before shutting down the system.

The front panel of the System Controller has two indicator LEDs and four control buttons in addition to the LCD screen. See Figure 5-1 for the location of the indicators and controls.

In the case of a forced shutdown, an error message is written into an event history file. This file can contain up to 10 error messages and can be viewed on the System Controller screen.


Note: If you wish to examine the error(s) recorded on the System Controller that caused a shutdown, do not reboot the system immediately.

When the system is rebooted, the System Controller transmits the errors it has logged in non-volatile random access memory (NVRAM) to the master CPU. They are then placed in /var/adm/SYSLOG, and the error log in the System Controller is cleared.

As shown in Figure 5-1, the key switch has three positions:

  • The Off position (with the key turned to the left) shuts down all voltages to the system boards and peripherals.

  • The On position (with the key in the center) enables the system and allows monitoring of menu functions.

  • The Manager position (with the key turned to the right) enables access to additional technical information used by service personnel.

As seen in Figure 5-1, there are four control buttons located on the System Controller front panel.

This list describes the buttons in order, from left to right:

  • Press the Menu button to place the display in the menu mode.

  • Press the Scroll up button to move up one message in the menu.

  • Press the Scroll down button to move down one message in the menu.

  • Press the Execute button to execute a displayed function or to enter a second-level menu.

The green power-on LED stays lit as long as 48 VDC voltage is being supplied to the system backplane. The yellow fault LED comes on whenever the System Controller detects a fault.

Figure 5-1. System Controller Front Panel Components

Figure 5-1 System Controller Front Panel Components

Four information options are available to the user when the key is in the On (middle) position:

  • the Master CPU Selection menu

  • the Event History Log menu

  • the Boot Status menu

  • the CPU Activity Display

The information displays are further described in the following sections, table, and figure.

The Master CPU Selection Menu

This menu monitors the current state of the system in the boot arbitration process. Table 5-1 shows the messages that may appear during and after the boot process.

The Event History Log Menu

The System Controller uses space in NVRAM to store up to 10 messages. All events logged by the System Controller are stored in the NVRAM log file. After the system successfully boots, the contents of the System Controller log file are transferred to /var/adm/SYSLOG by the master CPU.

Three basic types of system occurrence are logged in the history menu:

  • System error messages are issued in response to a system-threatening event, and the controller shuts down the system immediately after flagging the master CPU.

  • System events that need attention are immediately transmitted to the master CPU; however, no shutdown is implemented by the System Controller.

  • System Controller internal errors are monitored and logged in the menu just like system errors and events. They are transferred to /var/adm/SYSLOG by the CPU just like other errors. If the System Controller internal error is significant, an internal reinitialization will take place. An internal System Controller error never causes the Challenge deskside system to shut down.

Whenever possible, the System Controller alerts the master CPU that a system-threatening error situation exists and a shutdown is about to happen. The System Controller then waits for a brief period for the CPU to perform an internal shutdown procedure. The controller waits for a “Set System Off” command to come back from the master CPU before commencing shutdown. If the command does not come back from the CPU before a specified time-out period, the System Controller proceeds with the shutdown anyway.

Detection of a system event monitored by the System Controller automatically sends a message to the master CPU. The warning message is recorded in the event history log, and the CPU is expected to take corrective action, if applicable. No system shutdown is implemented.

See Appendix C for a complete list of messages that can appear in the event history log.

Boot Status Menu

The Boot Status menu supplies the last message sent by the master CPU after the master CPU selection process is concluded. A total of five status messages can appear under this menu selection. The messages are listed in Table 5-1, along with a brief explanation of their context and meaning.

Table 5-1. System Controller Master CPU Status Messages

Master CPU Status Message

Context and Meaning of Message

BOOT ARBITRATION NOT STARTED

The system CPU board(s) has not begun the arbitration process.

BOOT ARBITRATION IN PROGRESS

The system CPU boards are communicating to decide which one will be the system master CPU.

BOOT ARBITRATION IS COMPLETE SLOT #0X PROC #0X

The chosen CPU master has identified itself to the System Controller and communication is fully established.

BOOT ARBITRATION INCOMPLETE FAULT NO MASTER

The system was unable to assign a system master CPU.

BOOT ARBITRATION ABORTED

An operator pushed one of the front panel buttons while the System Controller was searching for the system master CPU.


The CPU Activity Display

The activity display is a graph function that provides a series of moving bars placed next to each other on the System Controller's screen. Each of the vertically moving bars on the screen represents the activity of one of the microprocessors in the Challenge deskside server.

The activity display is the default menu that appears if the key is in the On position and no keypad selections have been made within the last 60 seconds.

The activity graph is replaced by any detected fault message until the key is turned to the Off position.The activity graph (also known as a histogram) indicates the processor activity level of each microprocessor within the system. This display is similar to the bar graph display of volume levels on modern stereo receivers. Each bar gives a running account of the volume of processes taking place in a particular microprocessor.


Note: Figure 5-2 shows a total of 12 microprocessor histogram bars. Your system may have as few as one, depending on the number of CPU boards installed and the on-board microprocessors that they host.

Figure 5-2. Challenge CPU Board Microprocessor Activity Graph (Histogram)

Figure 5-2 Challenge CPU Board Microprocessor Activity Graph (Histogram)

Recovering From a System Crash

To minimize data loss from a system crash, back up your system daily and verify the backups. Often a graceful recovery from a crash depends upon good backups.

Your system may have crashed if it fails to boot or respond normally to input devices such as the keyboard. The most common form of system crash is terminal lockup—your system fails to accept any commands from the keyboard. Sometimes when a system crashes, data is damaged or lost.


Note: Before going through a crash recovery process, check your terminal configuration and cable connections. If everything is in order, try accessing the system through the system console (if present) or remotely from another terminal.

If none of the solutions in the previous paragraphs is successful, you can fix most problems that occur when a system crashes by using the methods described in the following paragraphs. You can prevent additional problems by recovering your system properly after a crash.

The following list presents several ways to recover your system from a crash. The simplest method, rebooting the system, is presented first. If that fails, go on to the next method, and so on. Here is an overview of the different crash recovery methods:

  • rebooting the system

    Rebooting usually fixes problems associated with a simple system crash.

  • restoring system software

    If you do not find a simple hardware connection problem and you cannot reboot the system, a system file might be damaged or missing. In this case, you need to copy system files from the installation tapes to your hard disk. Some site-specific information might be lost.

  • restoring from backup tapes

    If restoring system software fails to recover your system fully, you must restore from backup tapes. Complete and recent backup tapes contain copies of important files. Some user- and site-specific information might be lost. Read the following section for information on file restoration.

Refer to the IRIX Admin: Backup, Security, and Accounting manual for the instructions used to perform each of the recovery methods listed above.

If your system continues to fail, most likely you have a serious software problem, and you must restore the system software and files using the procedures described in the Personal System Administration Guide and the IRIX Admin: Backup, Security, and Accounting manuals for additional information. If the system fails to respond at all, call your service organization for assistance.