Chapter 5. Troubleshooting

This chapter provides the following sections to help you troubleshoot your system:

Troubleshooting Chart

Table 5-1 lists recommended actions for problems that can occur on your system. For problems that are not listed in this table, use the SGI Electronic Support system to help solve your problem or contact your SGI system support engineer (SSE). More information about the SGI Electronic Support system is provided in this chapter.

Table 5-1. Troubleshooting Chart

Problem Description

Recommended Action

The system will not power on.

Ensure that the power cord of the PDU is seated properly in the power receptacle.

Ensure that the PDU circuit breaker is on.

If the power cord is plugged in and the circuit breaker in on, contact your SSE.

An individual module will not power on.

Ensure that the power switch at the rear of the module is on (1 position).

View the L1 display; refer to Table 5-2

 if an error message is present.

If the L1 controller is not running, contact your SSE.

Check the connection between the module and its power source.

The system will not boot the operating system.

Contact your SSE.

The Service Required LED illuminates on an Origin 300 base, NUMAlink, or a PCI expansion module.

View the L1 display of the failing module; refer to Table 5-2

 for a description of the error message.

The Failure LED illuminates on an Origin 300 base, NUMAlink, or a PCI expansion module.

View the L1 display of the failing module; refer to Table 5-2

 for a description of the error message.

The green or yellow LED of a NUMAlink port (rear of NUMAlink module) is not illuminated.

Ensure that the NUMAlink cable is seated properly on the NUMAlink module and the destination module.

The PWR LED of a populated PCI slot is not illuminated.

Reseat the PCI card.

The Fault LED of a populated PCI slot is illuminated (on).

Reseat the PCI card. If the fault LED remains on, replace the PCI card.

The System Status LED of the TP900 is amber.

Contact your SSE.

The Power Status LED of the TP900 is amber.

Contact your SSE to replace the power supply module. The power supply module also has an amber LED that indicates a fault.

The Cooling Status LED of the TP900 is amber.

Contact your SSE to replace the cooling module. The cooling module also has an amber LED that indicates a fault.

The amber LED of a disk drive is on.

Replace the disk drive.


L1 Controller Error Messages

Table 5-2 lists error messages that the L1 controller generates and displays on the L1 display. This display is located on the front of the Origin 300 base modules, the NUMAlink module, and the PCI expansion modules.


Note: In Table 5-2, a voltage warning occurs when a supplied level of voltage is below or above the nominal (normal) voltage by 10 percent. A voltage fault occurs when a supplied level is below or above the nominal voltage by 20 percent.


Table 5-2. L1 Controller Messages

L1 System Controller Message

Message Meaning and Action Needed

Internal voltage messages:

 

ATTN: x.xV high fault limit reached @ x.xxV

30-second power-off sequence for the module.

ATTN: x.xV low fault limit reached @ x.xxV

30-second power-off sequence for the module.

ATTN: x.xV high warning limit reached @ x.xxV

A higher than nominal voltage condition is detected.

ATTN: x.xV low warning limit reached @ x.xxV

A lower than nominal voltage condition is detected.

ATTN: x.xV level stabilized @ x.xV

A monitored voltage level has returned to within acceptable limits.

Fan messages:

 

ATTN: FAN # x fault limit reached @ xx RPM

A fan has reached its maximum RPM level. The ambient temperature may be too high. Check to see if a fan has failed.

ATTN: FAN # x warning limit reached @ xx RPM

A fan has increased its RPM level. Check the ambient temperature. Check to see if the fan stabilizes.

ATTN: FAN # x stabilized @ xx RPM

An increased fan RPM level has returned to normal.

ATTN: TEMP # advisory temperature reached
@ xxC xxF

The ambient temperature at the module's air inlet has exceeded 30 ˚C.

ATTN: TEMP # critical temperature reached
@ xxC xxF

The ambient temperature at the module's air inlet has exceeded 35 ˚C.

ATTN: TEMP # fault temperature reached
@ xxC xxF

The ambient temperature at the module's air inlet has exceeded 40 ˚C.

Temperature messages: high alt.

 

ATTN: TEMP # advisory temperature reached
@ xxC xxF

The ambient temperature at the module's air inlet has exceeded 27 ˚C.

ATTN: TEMP # critical temperature reached
@ xxC xxF

The ambient temperature at the module's air inlet has exceeded 31 ˚C.

ATTN: TEMP # fault temperature reached @ xxC xxF

The ambient temperature at the module's air inlet has exceeded 35 ˚C.

Temperature stable message:

 

ATTN: TEMP # stabilized @ xxC/xxF

The ambient temperature at the module's air inlet has returned to an acceptable level.

Power off messages:

 

Auto power down in xx seconds

The L1 controller has registered a fault and is shutting down. The message displays every 5 seconds until shutdown.

Base module appears to have been powered down

The L1 controller has registered a fault and has shut down.


SGI Electronic Support

SGI Electronic Support provides system support and problem-solving services that function automatically, which helps resolve problems before they can affect system availability or develop into actual failures. SGI Electronic Support integrates several services so they work together to monitor your system, notify you if a problem exists, and search for solutions to the problem.

Figure 5-1 shows the sequence of events that occurs if you use all of the SGI Electronic Support capabilities.

Figure 5-1. Full Support Sequence

Full Support Sequence

The sequence of events can be described as follows:

  1. Embedded Support Partner (ESP) monitors your system 24 hours a day.

  2. When a specified system event is detected, ESP notifies SGI via e-mail (plain text or encrypted).

  3. Applications that are running at SGI analyze the information, determine whether a support case should be opened, and open a case if necessary. You and SGI support engineers are contacted (via pager or e-mail) with the case ID and problem description.

  4. SGI Knowledgebase searches thousands of tested solutions for possible fixes to the problem. Solutions that are located in SGI Knowledgebase are attached to the service case.

  5. You and the SGI support engineers can view and manage the case by using Supportfolio Online as well as search for additional solutions or schedule maintenance.

  6. Implement the solution.

Most of these actions occur automatically, and you may receive solutions to problems before they affect system availability. You also may be able to return your system to service sooner if it is out of service.

In addition to the event monitoring and problem reporting, SGI Electronic Support monitors both system configuration (to help with asset management) and system availability and performance (to help with capacity planning).

The following three components compose the integrated SGI Electronic Support system:

SGI Embedded Support Partner (ESP) is a set of tools and utilities that are embedded in the IRIX operating system. ESP can monitor a single system or group of systems for system events, software and hardware failures, availability, performance, and configuration changes, and then perform actions based on those events. ESP can detect system conditions that indicate potential problems, and then alert appropriate personnel by pager, console messages, or e-mail (plain text or encrypted). You also can configure ESP to notify an SGI call center about problems; ESP then sends e-mail to SGI with information about the event.

SGI Knowledgebase  is a database of solutions to problems and answers to questions that can be searched by sophisticated knowledge management tools. You can log on to SGI Knowledgebase at any time to describe a problem or ask a question. Knowledgebase searches thousands of possible causes, problem descriptions, fixes, and how-to instructions for the solutions that best match your description or question.

Supportfolio Online is a customer support resource that includes the latest information about patch sets, bug reports, and software releases.

The complete SGI Electronic Support services are available to customers who have a valid SGI Warranty, FullCare, FullExpress, or Mission-Critical support contract. To purchase a support contract that allows you to use the complete SGI Electronic Support services, contact your SGI sales representative. For more information about the various support contracts, refer to the following Web page:

http://www.sgi.com/support/customerservice.html

For more information about SGI Electronic Support, refer to the following Web page:

http://www.sgi.com/support/es