Chapter 5. Having Trouble?

This chapter contains hardware-specific information that can be helpful if you are having trouble with your Onyx deskside workstation.

Maintaining Your Hardware and Software

This section gives you some basic guidelines to help keep your hardware and the software that runs on it in good working order.

Hardware Do's and Don'ts

To keep your system in good running order, follow these guidelines:

  • Do not enclose the system in a small, poorly ventilated area (such as a closet), crowd other large objects around it, or drape anything (such as a jacket or blanket) over the system.

  • Do not connect cables or add other hardware components while the system is turned on.

  • Always remove the key from the front panel switch before shutting the drive door or minor damage may result.

  • Do not leave the front panel key switch in the Manager position.

  • Do not lay the system on its side.

  • Do not power off the system frequently; leave it running over nights and weekends, if possible. If a system console terminal is installed, it can be powered off when it is not being used.

  • Do not place liquids, food, or extremely heavy objects on the system or keyboard.

  • Ensure that all cables are plugged in completely.

  • Degauss the monitor every few days by pressing the degauss button on the front of the monitor.

  • Ensure that the system has power surge protection.

Software Do's and Don'ts

When your system is up and running, follow these guidelines:

  • Do not turn off power to a system that is currently started up and running software.

  • Do not use the root account unless you are performing administrative tasks.

  • Make regular backups (weekly for the whole system, nightly for individual users) of all information.

  • Keep two sets of backup tapes to ensure the integrity of one set while doing the next backup.

  • Protect the root account with a password:

    • Check for root UID = 0 accounts (for example, diag) and set passwords for these accounts.

    • Consider giving passwords to courtesy accounts such as guest
      and lp.

    • Look for empty password fields in the /etc/passwd file.

System Behavior

The behavior of a system that is not working correctly falls into three broad categories:

Operational 

You can log in to the system, but it doesn't respond as usual. For example, the picture looks strange, or the window doesn't respond to input from the mouse or keyboard.

Marginal 

You cannot start up the system fully, but you can reach the System Maintenance menu or PROM Monitor.

Faulty 

The system has shut down and you cannot reach the System Maintenance menu or PROM Monitor.

If the behavior of your system is operational, marginal, or faulty, first do a physical inspection using the checklist below. If all of the connections seem solid, go on to the section “Using the System Controller” and try to isolate the problem. If the problem persists, run the diagnostic tests from the System Maintenance menu or PROM Monitor. See the IRIX Admin: System Configuration and Operation manual for more information about diagnostic tests.

If this does not help, contact your system administrator or service provider.

Physical Inspection Checklist

Check every item on this list:

  • The console terminal and main unit power switches are turned on.

  • The circuit breaker next to the main power cord is not tripped.

  • The fans are running and the fan inlets/outlets are not blocked.

  • The System Controller LCD screen may display fault messages or warnings.

Before you continue, shut down the system and turn off the power.

Check all of the following cable connections:

  • The terminal power cable is securely connected to the terminal at one end and the power source at the other end.

  • The Onyx deskside workstation power cable is securely connected to the main unit at one end and plugged into the proper AC outlet at the other end.

  • The Ethernet cable is connected to the 15-pin connector port labeled Ethernet (and secured with the slide latch).

  • Serial port cables are plugged in securely to their corresponding connectors.

  • All cable routing is safe from foot traffic.

If you find any problems with hardware connections, have them corrected and turn on the power to the main unit. Use the System Controller to determine if internal system problems exist.

Using the System Controller

The System Controller has three basic operating modes:

  • It acts as a control conduit when directed by an operator to power off or boot up the system. It actively displays a running account of the boot process and flags any errors encountered. It sends the master CPU a message when a system event such as power off or a reboot is initiated.

  • When operating conditions are within normal limits, the System Controller is a passive monitor. Its front panel LCD offers a running CPU activity graph that shows the level of each on-board microprocessor's activity. Previously logged errors are available for inspection using the front panel control buttons to select menus.

  • The System Controller can also act independently to shut down the system when it detects a threatening condition. Or it can adjust electromechanical parameters (such as blower fan speed) to compensate for external change. Error information stored in the log is available in both the On and Manager positions. Service personnel can use the Manager key position functions to probe for system error information.

When a system fault occurs in the cardcage, ventilation system, or power boards, the System Controller turns off the power boards but leaves the 48V and V5_AUX on. This allows the yellow fault LED to remain lit and the System Controller to continue functioning. If, for example, the System Controller displays the error message POKA FAIL, your service provider can do a visual inspection of POKA indicator LEDs throughout the system to locate the failed component.


Note: If the system shuts down because an OVER TEMP condition occurs, the entire system shuts down. To find the fault, turn the key off and then on again. The LCD screen should show the OVER TEMP error; however, if the system is not given enough time to cool below the switch-off point, the System Controller will shut down again.

The System Controller also shuts down the entire system if a 48 V overvoltage fault occurs. If the System Controller removes power due to an overvoltage condition, the operator must execute the log function, turn the power off, and then turn it back on again. These steps are necessary to successfully power on the system. The purpose of this function is to prevent the operator from repeatedly applying power when an overvoltage condition exists.

The Power-On Process

You can monitor the boot process when you power on the system by watching the System Controller. When you turn the key switch to the On (middle) position on the System Controller front panel, it enables voltage to flow to the system backplane. The green power-on LED lights up, and immediately after that the yellow fault LED comes on. The System Controller initializes and performs its internal startup diagnostics. If no problems are found, the yellow fault LED shuts off.


Note: If the yellow fault LED stays on for more than a few seconds, a fault message should appear. If it stays on and no message appears on the display, you may have a faulty LCD screen or a problem with the System Controller. Contact your system administrator or service provider.

The following steps are similar to what you should see when you bring up the system:

  1. When the System Controller completes its internal checks and the system begins to come up, two boot messages appear:

    BOOT ARBITRATION IN PROGRESS
    BOOT ARBITRATION COMPLETE SLOT OxY PROC OxZ
    

  2. The screen clears and the message STARTING SYSTEM should appear.

  3. A series of status messages scrolls by. Most pass by so quickly that they are unreadable. These messages indicate the beginning or completion of a subsystem test.

  4. After all the system checks are complete, you receive a status message that looks similar to this:

    PROCESSOR STATUS
    B+++
    

The B+++ shown in step 4 indicates that the bootmaster microprocessor is active along with three other functioning microprocessors on the CPU board. If your CPU has only two microprocessors on board, you should see this:

PROCESSOR STATUS
B+

If you receive a processor status message followed by B+DD, you have a CPU with two of its microprocessors disabled. Contact your system administrator to determine why this was done.

If you receive a processor status message like B+-- or B+XX, the CPU has defective microprocessors on board. Make a note of the exact message and contact your service provider for help.

If the System Hangs

If the system does not make it through step 3 in the power-on process, an error message will appear and stay on the System Controller's LCD screen. A message like PD CACHE FAILED! indicates that a serious problem exists. Make a note of the final message the system displays and contact your service provider.

The message displayed on the System Controller LCD screen when a power-on hang occurs can give your service provider valuable information.

System Controller On Functions

Located just above the drive rack, the System Controller LCD and front panel provides users with information regarding any planned or unplanned shutdown of the system.

The System Controller monitors incoming air temperature and adjusts fan speed to compensate. It also monitors system voltages and the backplane clock. If an unacceptable temperature or voltage condition occurs, the System Controller shuts down the system.

Another major area the System Controller watches is the boot process. In the event of an unsuccessful boot, the controller's LCD panel indicates the general nature of the failure. A real-time clock resides on the System Controller, and the exact date and time of any shutdown is recorded.

When the System Controller detects a fault condition, it turns off power to the system boards and peripherals. The 48 VDC supplied to the system backplane stays on unless the shutdown was caused by an over-limit temperature condition or other situation that would be harmful to the system. The System Controller LCD screen displays a fault message, and the yellow fault LED near the top of the panel comes on. Fault LEDs are also positioned on other parts of the chassis to indicate a localized fault. Your service provider should check for these conditions before shutting down the system.

The front panel of the System Controller has two indicator LEDs and four control buttons in addition to the LCD screen. See Figure 5-1 for the location of the indicators and controls.

In the case of a forced shutdown, an error message is written into an event history file. This file can contain up to 10 error messages and can be viewed on the System Controller screen.


Note: If you wish to examine the error(s) recorded on the System Controller that caused a shutdown, do not reboot the system immediately.

When the system is rebooted, the System Controller transmits the errors it has logged in non-volatile random access memory (NVRAM) to the master CPU. They are then placed in /var/adm/SYSLOG, and the error log in the System Controller is cleared.

As shown in Figure 5-1, the key switch has three positions:

  • The Off position (with the key turned to the left) shuts down all voltages to the system boards and peripherals.

  • The On position (with the key in the center) enables the system and allows monitoring of menu functions.

  • The Manager position (with the key turned to the right) enables access to additional technical information used by service personnel.

As seen in Figure 5-1, there are four control buttons located on the System Controller front panel. This list describes the buttons in order, from left to right:

  • Press the Menu button to place the display in the menu mode.

  • Press the Scroll Up button to move up one message in the menu.

  • Press the Scroll Down button to move down one message in the menu.

  • Press the Execute button to execute a displayed function or to enter a second-level menu.

The green power-on LED stays lit as long as 48 VDC voltage is being supplied to the system backplane. The yellow fault LED comes on whenever the System Controller detects a fault.

Figure 5-1. System Controller Front Panel Components

Figure 5-1 System Controller Front Panel Components

Four information options are available to the user when the key is in the On (middle) position:

  • the Master CPU Selection menu

  • the Event History Log menu

  • the Boot Status menu

  • the CPU Activity Display

The information displays are further described in the following sections, table, and figure.

The Master CPU Selection Menu

This menu monitors the current state of the system in the boot arbitration process. Table 5-1 shows the messages that may appear during and after the boot process.

The Event History Log Menu

The System Controller uses space in NVRAM to store up to ten messages. All events logged by the System Controller are stored in the NVRAM log file. After the system successfully boots, the contents of the System Controller log file are transferred to /var/adm/SYSLOG by the master CPU.

Three basic types of system occurrence are logged in the history menu:

  • System error messages are issued in response to a system-threatening event, and the controller shuts down the system immediately after flagging the master CPU.

  • System events that need attention are immediately transmitted to the master CPU; however, no shutdown is implemented by the System Controller.

  • System Controller internal errors are monitored and logged in the menu just like system errors and events. They are transferred to /var/adm/SYSLOG by the CPU just like other errors. If the System Controller internal error is significant, an internal reinitialization takes place. An internal System Controller error never causes the Onyx deskside system to shut down.

Whenever possible, the System Controller alerts the master CPU that a system-threatening error situation exists and a shutdown is about to happen. The System Controller then waits for a brief period for the CPU to perform an internal shutdown procedure. The controller waits for a “Set System Off” command to come back from the master CPU before commencing shutdown. If the command does not come back from the CPU before a specified time-out period, the System Controller proceeds with the shutdown anyway.

Detection of a system event monitored by the System Controller automatically sends a message to the master CPU. The warning message is recorded in the event history log, and the CPU is expected to take corrective action, if applicable. No system shutdown is implemented.

See Appendix C for a complete list of messages that can appear in the event history log.

Boot Status Menu

The Boot Status menu supplies the last message sent by the master CPU after the master CPU selection process is concluded. A total of five status messages can appear under this menu selection. The messages are listed in Table 5-1, along with a brief explanation of their context and meaning.

Table 5-1. System Controller Master CPU Status Messages

Master CPU Status Message

Context and Meaning of Message

BOOT ARBITRATION NOT STARTED

The system CPU board(s) has not begun the arbitration process.

BOOT ARBITRATION IN PROGRESS

The system CPU boards are communicating to decide which one will be the system master CPU.

BOOT ARBITRATION IS COMPLETE SLOT #0X PROC #0X

The chosen CPU master has identified itself to the System Controller and communication is fully established.

BOOT ARBITRATION INCOMPLETE FAULT NO MASTER

The system was unable to assign a system master CPU.

BOOT ARBITRATION ABORTED

An operator pushed one of the front panel buttons while the System Controller was searching for the system master CPU.


The CPU Activity Display

The activity display is a graph function that provides a series of moving bars placed next to each other on the System Controller's screen. Each of the vertically moving bars on the screen represents the activity of one of the microprocessors in the Onyx deskside workstation.

The activity display is the default menu that appears if the key is in the On position and no keypad selections have been made within the last 60 seconds.


Note: The activity graph is replaced by any detected fault message until the key is turned to the Off position.

The activity graph (also known as a histogram) indicates the processor activity level of each microprocessor within the system. This display is similar to the bar graph display of volume levels on modern stereo receivers. Each bar gives a running account of the volume of processes taking place in a particular microprocessor.


Note: Figure 5-2 shows a total of four microprocessor histogram bars. Your system may have as few as one.

Figure 5-2. Onyx CPU Board Microprocessor Activity Graph (Histogram)

Figure 5-2 Onyx CPU Board Microprocessor Activity Graph (Histogram)

If your system continues to fail, most likely you have a serious software problem, and you must restore the system software and files using the procedures described in the following sections. Reference the Personal System Administration Guide and the IRIX Admin: Backup, Security, and Accounting manuals for additional information. If the system fails to respond at all, call your service organization for assistance.

Recovering From a System Crash

To minimize data loss from a system crash, back up your system daily and verify the backups. Often a graceful recovery from a crash depends upon good backups.

Your system may have crashed if it fails to boot or respond normally to input devices such as the keyboard. The most common form of system crash is terminal lockup—your system fails to accept any commands from the keyboard. Sometimes when a system crashes, data is damaged or lost.

Before going through a crash recovery process, check your terminal configuration and cable connections. If everything is in order, try accessing the system remotely from another workstation or from the system console terminal (if present).

If this does not work, you may try to shut down the graphics interface (Xsgi) by using a simultaneous four-keystroke input called the “Vulcan Death Grip.” Simultaneously press

<Shift-Ctrl-F12-numeric keypad />.

If none of the solutions in the previous paragraphs is successful, you can fix most problems that occur when a system crashes by using the methods described in the following paragraphs. You can prevent additional problems by recovering your system properly after a crash.

The following list presents several ways to recover your system from a crash. The simplest method, rebooting the system, is presented first. If that fails, go on to the next method, and so on. Here is an overview of the different crash recovery methods:

  • rebooting the system

    Rebooting usually fixes problems associated with a simple system crash.

  • restoring system software

    If you do not find a simple hardware connection problem and you cannot reboot the system, a system file might be damaged or missing. In this case, you need to copy system files from the installation tapes to your hard disk. Some site-specific information might be lost.

  • restoring from backup tapes

    If restoring system software fails to recover your system fully, you must restore from backup tapes. Complete and recent backup tapes contain copies of important files. Some user- and site-specific information might be lost. Read the following section for information on file restoration.

Restoring a Filesystem From the System Maintenance Menu

If your root filesystem is damaged and your system cannot boot, you can restore your system from the System Maintenance Menu. This is the menu that appears when you interrupt the boot sequence before the operating system takes over the system. To perform this recovery, you need two different tapes: your system backup tape and a bootable tape with the miniroot.

If a backup tape is to be used with the System Recovery option of the System Maintenance Menu, it must have been created with the System Manager or with the Backup(1) command, and must be a full system backup (beginning in the root directory (/) and containing all the files and directories on your system). Although the Backup command is a front-end interface to the bru(1) command, Backup also writes the disk volume header on the tape so that the “System Recovery” option can reconstruct the boot blocks, which are not written to the tape using other backup tools. For information on creating the system backup, see the IRIX Admin: Backup, Security, and Accounting manual.

If you do not have a full system backup made with the Backup command or System Manager —and your root or usr filesystems are so badly damaged that the operating system cannot boot—you have to reinstall your system.

If you need to reinstall the system to read your tapes, install a minimal system configuration and then read your full system backup (made with any backup tool you prefer) over the freshly installed software.

This procedure should restore your system to its former state.


Caution: Existing files of the same pathname on the disk are overwritten during a restore operation, even if they are more recent than the files on tape.


  1. When you first start up your machine, you see the following prompt:

    Starting up the system....
    To perform system maintenance instead, press <Esc>
    

  2. Press the <Esc> key. You see the following menu:

    System Maintenance Menu
    1   Start System
    2   Install System Software
    3   Run Diagnostics
    4   Recover System
    5   Enter Command Monitor
    

  3. Enter the numeral 4 and press <Return>. You see a message

    System Recovery...
    Press Esc to return to the menu.
    

    After a few moments, you see a message:

    Insert the installation tape, then press <enter>: 
    

  4. Insert your bootable tape and press the <Enter> key. You see some messages while the miniroot is loaded. Next you see the message:

    Copying installation program to disk....
    

    Several lines of dots appear on your screen while this copy takes place.

  5. You see the message

    CRASH RECOVERY
    You may type sh to get a shell prompt at most questions.
    Remote or local restore: ([r]emote, [l]ocal): [l]
    

  6. Press <Enter> for a local restoration. If your tape drive is on another system accessible by the network, press r and then the <Enter> key. You are prompted for the name of the remote host and the name of the tape device on that host. If you press <Enter> to select a local restoration, you see the message

    Enter the name of the tape device: [/dev/tape] 
    

    You may need to enter the exact device name of the tape device on your system, since the miniroot may not recognize the link to the convenient /dev/tape filename. As an example, if your tape drive is drive #2 on your integral SCSI bus (bus 0), the most likely device name is /dev/rmt/tps0d2nr. If it is drive #3, the device is /dev/rmt/tps0d3nr.

  7. The system prompts you to insert the backup tape. When the tape has been read back onto your system disk, you are prompted to reboot your system.

Recovery After System Corruption

From time to time you may experience a system crash due to file corruption. Systems cease operating (“crash”) for a variety of reasons. Most common are software crashes, followed by power failures of some sort, and least common are actual hardware failures. Regardless of the type of system crash, if your system files are lost or corrupted, you may need to recover your system from backups to its pre-crash configuration.

Once you repair or replace any damaged hardware, you are ready to recover the system. Regardless of the nature of your crash, you should reference the information in the section “Restoring a Filesystem from the System Maintenance Menu” in the IRIX Admin: Backup, Security, and Accounting manual.

The System Maintenance Menu recovery command is designed for use as a full backup system recovery. After you have done a full restore from your last complete backup, you may restore newer files from incremental backups at your convenience. This command is designed to be used with archives made using the Backup(1) utility or through the System Manager. The System Manager is described in detail in the Personal System Administration Guide. System recovery from the System Maintenance Menu is not intended for use with the tar(1), cpio(1), dd(1), or dump(1) utilities. You can use these other utilities after you have recovered your system.

You may also be able to restore filesystems from the miniroot. For example, if your root filesystem has been corrupted, you may be able to boot the miniroot, unmount the root filesystem, and then use the miniroot version of restore, xfs_restore, bru, cpio, or tar to restore your root filesystem. Refer to the reference (man) pages on these commands for details on their application.

Refer to the IRIX Admin: System Configuration and Operation manual for instructions on good general system administration practices.