Chapter 5. Troubleshooting and Diagnostics

If you are experiencing problems with your Silicon Graphics Fuel workstation, contact your service provider:

This chapter includes the following sections:

Troubleshooting

This section covers the following topics:

  • Environmental Fault Monitoring

  • LED Lightbar

Environmental Fault Monitoring

The workstation monitors its environment to ensure proper operation. It will automatically power off if any of the following faults are found:

  • Any fan spins at less than 80% of nominal speed.

  • Any temperature sensor registers 158 °F (70 °C) or above.

  • Any voltage reaches +/- 20% of nominal.

If your workstation is powering off unexpectedly, check for these conditions.

LED Lightbar

The LED lightbar on the workstation bezel can provide important troubleshooting information. Table 5-1 shows a list of LED signals and what they mean.

Table 5-1. LED Lightbar Signals

LED Lightbar Signal

Explanation

Blinking white

Power button pressed (On or Off)

Solid white

Successful PROM boot/ OS running

Solid red

System board failure
(failed to read prom at power on)

Blinking red

During boot sequence: memory error

While OS is running: kernel panic

Blinking red and white

Graphics configuration error


Diagnostics

The Silicon Graphics Fuel visual workstation is equipped with diagnostics to test the system hardware and diagnose part failures. These diagnostics are grouped into three categories:

  • Power-on diagnostics (POD)
    Power-on diagnostics are PROM-resident tests that run automatically when you power on the system. As the boot process discovers hardware components, it runs power-on diagnostics to verify that each component that is needed to boot the system is working correctly. Refer to “Power-on Diagnostics” for more information about POD.

  • Offline diagnostics
    Offline diagnostics use a standalone diagnostic environment to test the system hardware; the operating system cannot be running while you use offline diagnostics. Refer to “Offline Diagnostics” for more information.

  • Online diagnostics
    Online diagnostics are tests that verify system hardware while the operating system is running. To prevent data loss, you should use the online diagnostics only when the system is idle. Refer to “Online Diagnostics” for more information.


    Note: The diagnostics described in this document run only on Silicon Graphics Fuel visual workstations. They will not work on any other SGI systems.


Power-on Diagnostics

The power-on diagnostics run automatically when you power on or reset the system. As the boot process discovers hardware, it verifies that each component is functional enough to load the operating system.

The power-on diagnostics test the hardware in the following order:

  • CPU

  • Bedrock ASIC

  • PROM

  • Memory DIMMs

  • Secondary cache

  • Xbridge ASIC

  • PCI slots

  • Serial ports

  • SCSI controller

  • Keyboard and mouse

  • VPro graphics

  • Ethernet port

If the power-on diagnostics complete successfully, the System Maintenance menu appears or the system automatically boots, depending on how the system is configured.

If the power-on diagnostics detect errors, the diagnostics disable the failing hardware and continue testing. When testing completes, the system may or may not be able to boot, depending on the hardware that has been disabled. If the system does not boot, contact your service representative. For more information about product support, refer to “ Product Support”.

Offline Diagnostics

Offline diagnostics run a sequence of tests on the system hardware under a standalone diagnostic environment; the operating system cannot be running while the offline diagnostics test the system

The offline diagnostics include a “launcher” that automatically runs a sequence of tests. In most cases, you should run the offline diagnostics automatically with the launcher. Use the following procedure to run launcher:

  1. Power on the system.

  2. Wait until the System Maintenance menu appears.


    Note: If the Autoload PROM variable is set to Yes, you must click on the Stop for Maintenance button to access the System Maintenance menu.


  3. Select the Run Diagnostics option.


    Note: You can also start the launcher by entering the following command at the command monitor (PROM) prompt (>>):
    boot -f dksc (0,1,0) /stand/smdk/smdk --a


The launcher automatically runs the offline diagnostics on system components in the following order:

  • CPU

  • Secondary cache

  • Memory DIMMs

  • Motherboard (including the USB ports, serial ports, Ethernet port, parallel port, mouse port, keyboard port, Xbridge ASIC, and PCI slots)


    Note: The offline diagnostics test the simpler components first and then proceed to the more complex components.


Table 5-2 shows the approximate time required (in minutes and seconds format) to automatically run the offline diagnostics on a workstation with a 500-MHz processor and 512 MB of memory. (Your testing time will vary, depending on your hardware configuration.)

Table 5-2. Time Required to Run Offline Diagnostics

Testing Progress

Total Elapsed Time

The launcher boot-up sequence starts

0:00

The launcher boot-up sequence completes

0:10

PIMM testing completes

0:40

Secondary cache testing completes

1:17

Memory DIMM testing completes

5:05

Motherboard testing completes

7:30

The offline diagnostics display test status information as they run. If the diagnostics complete testing without detecting errors, the output is similar to the following example:

SMDK SGI Version 6.93 TEST built 10:20:12 AM Sep 21, 2001
smdk loading io discovery code...
smdk loading launcher code...
smdk>term none
Setting up diagnostics.....
Starting diagnostics.....
Testing  PIMM........   PASSED
Testing  CACHE................   PASSED
Testing  DIMM........................................................................................................................................................................................................................................................................................................   PASSED
Testing  Mother Board...

FINISHED
All diagnostics passed.
resetting the system...

If the launcher detects an error, it displays a FAILED status message for the hardware it is testing and stops testing. If any of the components do not pass the offline diagnostics, contact your service representative.

Online Diagnostics


Caution: The runalldiags script should be run while the system is idle. If you run the online diagnostics while the system is in use, data may be lost.

Online diagnostics are tests that verify system hardware while the operating system is running. When you run the online diagnostics from the IRIX operating system prompt, each diagnostic runs a set of tests for a certain number of loops. The online diagnostics test the following areas of the system:

  • CPU

  • Memory

  • I/O

  • Graphics

  • Storage devices

  • Network devices

The online diagnostics also run a system stress test, which tests all areas of the system under heavy load.

The Customer Diagnostics 1.0 CD, SGI part number 812-1122-001, includes the online diagnostics that are available for customer use. This CD ships with all Silicon Graphics Fuel visual workstations. You need to install files from the CD on a system before you can run the online diagnostics. The CD booklet includes installation procedures.

The runalldiags script automatically runs a sequence of online diagnostics. It runs in three modes:

  • Basic mode verifies memory and performs 30 minutes of stress testing. (If you want to perform regularly scheduled testing, use basic mode.)

  • Normal mode performs the same tests as basic mode and also performs I/O testing. (The I/O testing may disrupt the serial port and USB devices.)

  • Extensive mode performs more disruptive I/O testing. (Ethernet is unavailable, and USB operations are disrupted.) It also performs more intensive CPU, memory, and stress testing. Use this mode only if you suspect there is a problem with the system.

Follow these steps to run the runalldiags script:


Note: You must have root level access to the system to run online diagnostics.


  1. Enter the following command at the command prompt to change to the directory that contains the diagnostics:
    cd /usr/diags/bin

  2. Enter the following command to start the script:
    ./runalldiags [options]


    Note: When you run runalldiags in -normal or -extensive modes, you should run it from the console. The Ethernet testing that runalldiags performs in -normal and -extensive modes disrupts any telnet sessions on the system.


Refer to Table 5-3 for descriptions of the command-line options.

Table 5-3. runalldiags Command-line Options

Option

Description

-h | -help

Displays help information

-basic

Runs the script in basic mode

-normal

Runs the script in normal mode (default)

-extensive

Runs the script in extensive mode

-host <host>

Specifies a system to target for network tests

-d <directory>

Specifies the directory that contains the online diagnostics

If a diagnostic fails, the script saves the output from the diagnostic in a file in the /tmp directory (for example, /tmp/diagTestOutput.1.olenet). Output from the script indicates the actual name of the file. When a diagnostic fails, the script continues to run the remaining diagnostics.


Note: If you have USB devices connected to your workstation, you must disconnect the USB cables from the rear of the enclosure after the online diagnostics have finished running. Then reconnect the cables to restore the USB devices.


Example Output

Online diagnostics display PASS [testname] when a test is passed, and FAIL [testname] when a test is failed.

The following example shows output from running runalldiags in basic mode with no errors:

shad# ./runalldiags -basic
 
Running online diagnostics at Basic level

Time: Mon Oct  1 10:55:53 CDT 2001
System Information: IRIX64 shad 6.5-wolfi-root-SN1O 6.5.10m 07171440 IP35
Plan on running: olmem pandora

olmem - Online Memory Diagnostic    (Check /var/adm/SYSLOG for error message)
/usr/diags/bin/olmem
PASS(olmem)
pandora - System Stress Test
/usr/diags/bin/pandora -runtime 30
PASS(pandora)

Finished running at Mon Oct  1 11:35:38 CDT 2001
Ran: 2 Failed: 0

The following example shows output from running runalldiags in basic mode with one error:

shad# ./runalldiags -basic
 
Running online diagnostics at Basic level

Time: Mon Oct  1 10:55:53 CDT 2001
System Information: IRIX64 shad 6.5-wolfi-root-SN1O 6.5.10m 07171440 IP35
Plan on running: olmem pandora

olmem - Online Memory Diagnostic    (Check /var/adm/SYSLOG for error message)
/usr/diags/bin/olmem
PASS(olmem)
pandora - System Stress Test
/usr/diags/bin/pandora -runtime 30
FAIL(pandora): see /tmp/diagFailure.0.pandora
Finished running at Mon Oct 1 11:35:38 CDT 2001
Ran: 1 Failed: 1

If any of the components do not pass the online diagnostics, contact your service representative.