Index

Index
Prev

64-bit address space

Selecting an ABI and ISA

adi2 example program

Program adi2

aliasing models

Understanding Aliasing Models

Amdahl's law

Understanding Parallel Speedup and Amdahl's Law

awk script for

Awk Script for Amdahl's Law Estimation

execution time given n and p

Predicting Execution Time with n CPUs

parallel fraction p

Understanding Amdahl's Law

parallel fraction p given speedup( n )

Calculating the Parallel Fraction of a Program

speedup(n ) given p

Understanding Amdahl's Law

superlinear speedup

Understanding Superlinear Speedup

application binary interface (ABI)

Selecting an ABI and ISA

64-bit

64-Bit ABI

new 32-bit

New 32-Bit ABI

old 32-bit

Old 32-Bit ABI

arithmetic error

Understanding Arithmetic Standards

array padding

Using Array Padding to Prevent Thrashing
Diagnosing and Eliminating Cache Thrashing
Using Array Padding

auto-parallelizing

Compiling Serial Code for Parallel Execution

Bentley, Jon

Bentley's Rules Updated

cache

and hardware event counter

Primary Cache Use

blocking

Understanding Cache Blocking
Controlling Cache Blocking

cache miss

Understanding Level-One and Level-Two Cache Use

coherent

Understanding Cache Coherency
Cache Coherency Events

compiler's model of

Adjusting the Optimizer's Cache Model

contention in

Diagnosing Cache Problems

correcting

Correcting Cache Contention in General

event 31 reveals

Diagnosing Cache Problems
Identifying False Sharing

diagnosing problems in

Identifying Cache Problems with Perfex and SpeedShop
Diagnosing Cache Problems

directory-based

Memory Overhead Bits
Understanding Directory-Based Coherency

false sharing of

Identifying False Sharing

L1

Level-1 Cache
Understanding Level-One and Level-Two Cache Use
Primary Cache Use

L2

Level-Two Cache
Understanding Level-One and Level-Two Cache Use
Secondary Cache Use

line size

Understanding Level-One and Level-Two Cache Use

data structure blocking for

Data Structure Augmentation

on-chip

Cache Architecture

operation of

Understanding Cache Coherency
Understanding Directory-Based Coherency
Understanding Level-One and Level-Two Cache Use

principles of use

Principles of Good Cache Use

proper use of

Principles of Good Cache Use
Using Other Cache Techniques

array padding

Using Array Padding

blocking data for

Understanding Cache Blocking
Controlling Cache Blocking

grouping related data for

Grouping Data Used at the Same Time

loop fusion for

Understanding Loop Fusion

parallel execution issues

Diagnosing Cache Problems

stride-one access for

Using Stride-One Access

transposition for

Understanding Transpositions

set-associative

Understanding Level-One and Level-Two Cache Use

thrashing in

Understanding Cache Thrashing

snoopy

Coherency Methods

thrashing

Understanding Cache Thrashing
Diagnosing and Eliminating Cache Thrashing

cache coherence

and hardware event counter

Cache Coherency Events

cache coherency

Understanding Cache Coherency

cache line

Understanding Level-One and Level-Two Cache Use

call hierarchy profile

Profiling the Call Hierarchy

compiler directive

See directive

Reader Comments

compiler feedback file

Creating a Compiler Feedback File

compiler flag

See compiler option

Reader Comments

compiler option

-32

Old 32-Bit ABI

-64

64-Bit ABI

recommended

Understanding Compiler Options

-apo

Compiling an Auto-Parallel Version of a Program

-check_bounds

Computational Differences
Using Array Padding

-clist

Reading the Transformation File

default

Understanding Compiler Options

-fb

Creating a Compiler Feedback File
Passing a Feedback File

-flist

Reading the Transformation File

for cache model

Adjusting the Optimizer's Cache Model

IEEE_arithmetic

Exploit Algebraic Identities

-INLINE

Using Manual Inlining
Using Automatic Inlining

-IPA

Requesting IPA

forcedepth

Using Automatic Inlining

inline

Using Automatic Inlining

space

Using Automatic Inlining

-LNO

Using Loop Nest Optimization

blocking

Adjusting Cache Blocking Block Sizes

fission

Controlling Fission and Fusion

gather_scatter

Understanding Gather-Scatter

ignore_pragmas

Requesting LNO

interchange=off

Using Loop Interchange

outer_unroll

Controlling Loop Unrolling

prefetch

Controlling Prefetching

vintr

Vector Intrinsics

-mips3

New 32-Bit ABI

-mips4

New 32-Bit ABI
Recommended Starting Options

-n32

New 32-Bit ABI
Recommended Starting Options

-On

Setting Optimization Level with -On

-O2

Recommended Starting Options

-O3

for SWP

Enabling Software Pipelining with -O3

-Ofast

versus -O3

Compile -O3 or -Ofast for Critical Modules

-Olimit

Using Automatic Inlining

-OPT

alias

Understanding Aliasing Models

cray_ivdep

Breaking Other Dependencies

IEEE_arithmetic

Recommended Starting Options
IEEE Conformance

IEEE_NaN_inf

IEEE Conformance

liberal_ivdep

Breaking Other Dependencies

reorg_common

Using Array Padding

roundoff

Roundoff Control

-r10000

Standard Math Library
Setting Target System with -TARG

-r5000

Standard Math Library
Setting Target System with -TARG

-r8000

Standard Math Library
Setting Target System with -TARG

roundoffWhen

Exploit Algebraic Identities

-S

Reading Software Pipelining Messages

-static

Uninitialized Variables

-TARG

Setting Target System with -TARG

-TENV

Profiling Exception Frequency

X

Controlling the Level of Speculation

copying

to reduce TLB thrashing

Using Copying to Circumvent TLB Thrashing

correctness

Getting the Right Answers

CPU

See MIPS CPU

Reader Comments

CrayLink

Hub and NUMAlink

data distribution

Using Data Distribution Directives

and dplace

Using _DSM_VERBOSE

directives for

Understanding Directive Syntax

Distribute directive

Using Distribute for Loop Parallelization

mapping types

Understanding Distribution Mapping Options

ONTO clause

Understanding the ONTO Clause

page placement

Using the Page_Place Directive for Custom Mappings

redistribution

Understanding the Redistribution Directives

reshaped

Using Reshaped Distribution Directives

restrictions

Restrictions of Reshaped Distribution

data placement

Scalability and Data Placement

for libmp programs

Tuning Data Placement for MP Library Programs

modifying code for

Modifying the Code to Tune Data Placement

DAXPY

Understanding Software Pipelining

and alias model

Understanding Aliasing Models

loop fusion of

Understanding Loop Fusion

with indirection

Breaking Other Dependencies

debugging

possible with -O2

Start with -O2 for All Modules

use -O0 for

Use -O0 for Debugging

dependency

Breaking Other Dependencies

directive

blocking size

Adjusting Cache Blocking Block Sizes

for data distribution

Fortran Source with Directives
Using Data Distribution Directives

Distribute

Using Distribute for Loop Parallelization

page place

Using the Page_Place Directive for Custom Mappings

syntax

Understanding Directive Syntax

for loop interchange

Using Loop Interchange

for loop nest optimizer

Requesting LNO

for loop unrolling

Controlling Loop Unrolling

for parallel execution

Fortran Source with Directives

affinity clause

Using Parallel Do with Distributed Data
Understanding the AFFINITY Clause for Data
Understanding the AFFINITY Clause for Threads

nest clause

Understanding the NEST Clause

for prefetching

Controlling Prefetching

ivdep

Breaking Other Dependencies

OpenMP

Fortran Source with Directives

dlook

Applying dlook

dplace

Non-MP Library Programs and Dplace

disables data distributiondirectives

Using _DSM_VERBOSE

enable migration with

Enabling Page Migration

library interface to

Using the dplace Library for Dynamic Placement

not for use with libmp

Non-MP Library Programs and Dplace

placement file

Placement File Syntax

distribute statement

Assigning Threads to Memories

memories statement

Using the memories Statement

threads statement

Using the threads Statement

set page size with

Changing the Page Size
Using Larger Page Sizes to Reduce TLB Misses

specify topology with

Specifying the Topology

with MPI

Using dplace with MPI 3.1

dprof

Applying dprof

dynamic page migration

Dynamic Page Migration
Enabling Page Migration
Trying Dynamic Page Migration

administration

Trying Dynamic Page Migration

enabling

Trying Dynamic Page Migration

environment variable

_DSM_MIGRATION

Trying Dynamic Page Migration
Experimenting with Migration Levels

_DSM_PPM

Advanced Options

_DSM_ROUND_ROBIN

Trying Round-Robin Placement

_DSM_VERBOSE

Using _DSM_VERBOSE

for SpeedShop

Identifying False Sharing

in dplace placement file

Using Environment Variables in Placement Files

MP_SET_NUMTHREADS

Controlling a Parallelized Program at Run Time

MPI_DSM_OFF

Using dplace with MPI 3.1

PAGESIZE_*

Using Larger Page Sizes to Reduce TLB Misses

SGI_ABI

Specifying the ABI

SpeedShop use of

Sampling Through Other Hardware Counters

TRAP_FPE

Understanding Treatment of Underflow Exceptions

event counter

See hardware event counter

Reader Comments

exception

event counter overflow

R10000 Counter Event Types

from speculative execution

Permitting Speculative Execution

handling

Using Exception Profiling

profiling occurrence of

Using Exception Profiling

TLB miss

Understanding TLB and Virtual Memory Use

underflow

Understanding Treatment of Underflow Exceptions

exception profile

Using Exception Profiling

false sharing

Memory Contention
Identifying False Sharing

fast fourier transform (FFT)

Understanding Transpositions

data placement for

First-Touch Placement with Multiple Data Distributions

feedback file

Creating a Compiler Feedback File

use of

Passing a Feedback File

FFT

See fast fourier transform (FFT)

Reader Comments

first-touch placement

Using First-Touch Placement
Programming For First-Touch Placement

floating-point exception

See exception

Reader Comments

floating-point status register (FSR)

Understanding Treatment of Underflow Exceptions

graduated instruction

Graduated Instructions

hardware event counter

R10000 Counter Event Types

branch instructions

Branching Instructions

cache coherency

Cache Coherency Events

cache use

Primary Cache Use

clock cycles

Clock Cycles

event 21

Displaying Operation Counts
Finding and Removing Memory Access Problems

event 31

Sampling Through Other Hardware Counters
Finding and Removing Memory Access Problems
Diagnosing Cache Problems
Identifying False Sharing

event 4

Finding and Removing Memory Access Problems

instruction counts

Instructions Issued and Done

lock instructions

Lock-Handling Instructions

profiling from

Sampling through Hardware Event Counters
Sampling Through Other Hardware Counters

TLB miss

Virtual Memory Use

hardware graph

Indicating Resource Affinity

hardware trap

See exception, page fault, TLB

Reader Comments

hub

SN0 Organization
Hub and NUMAlink

cache coherency support

Understanding Directory-Based Coherency

hypercube

SN0 Organization
SN0 Memory Distribution

ideal time profile

Using Ideal Time Profiling

IEEE 754

Understanding Arithmetic Standards

versus optimization

IEEE Conformance

IEEE arithmetic

Understanding Arithmetic Standards

inlining

Understanding Inlining

automatic versus manual

Understanding Inlining

manual with -INLINE

Using Manual Inlining

instruction scheduling

Setting Target System with -TARG
Understanding Software Pipelining

instruction set architecture (ISA)

MIPS I

Old 32-Bit ABI

MIPS II

Old 32-Bit ABI

MIPS III

Old 32-Bit ABI
New 32-Bit ABI

MIPS IV

MIPS IV Instruction Set Architecture
New 32-Bit ABI

interprocedural analysis (IPA)

Exploiting Interprocedural Analysis

applied during link step

Compiling and Linking with IPA

features of

Exploiting Interprocedural Analysis

requesting

Requesting IPA

-IPA

See compiler option, -IPA

Reader Comments

IRIX

memory management in

SN0 Memory Management

porting to

Dealing with Porting Issues

lazy evaluation

Lazy Evaluation

ld

performs IPA

Compiling and Linking with IPA

library

BLAS

CHALLENGEcomplib Library
SCSL Library

CHALLENGEcomplib

Exploiting Existing Tuned Code
CHALLENGEcomplib Library

EISPACK

CHALLENGEcomplib Library

LAPACK

CHALLENGEcomplib Library
SCSL Library

libc

Standard Math Library

libfastm

Exploiting Existing Tuned Code
libfastm Library
Recommended Starting Options

libfpe

Using Exception Profiling
Understanding Treatment of Underflow Exceptions

libmp

Controlling a Parallelized Program at Run Time

conflicts with dplace

Non-MP Library Programs and Dplace

data placement with

Tuning Data Placement for MP Library Programs

page migration with

Trying Dynamic Page Migration
Experimenting with Migration Levels

page size control

Using Larger Page Sizes to Reduce TLB Misses

round-robin placement with

Trying Round-Robin Placement

LINPACK

CHALLENGEcomplib Library

SCSL

Exploiting Existing Tuned Code
SCSL Library

library routine

bzero

Initializing to Zero

calloc

Initializing to Zero

dplace_file

Using the dplace Library for Dynamic Placement

dplace_line

Using the dplace Library for Dynamic Placement

dsm_home_threadnum

Using Dynamic Placement Information

handle_sigfpes

Using Exception Profiling

sasum

Using Reshaped Distribution Directives

sscal

Using Reshaped Distribution Directives

-LNO

See loop nest optimizer (LNO) and compiler option -LNO

Reader Comments

loop fission

Using Loop Fission

loop fusion

by LNO

Using Loop Fusion

manual

Understanding Loop Fusion

loop interchange

Using Loop Interchange

disabling

Using Loop Interchange

loop nest optimizer (LNO)

Using Loop Nest Optimization

cache blocking by

Controlling Cache Blocking

controlling

Adjusting Cache Blocking Block Sizes

disable loop transformation

Requesting LNO

gather-scatter by

Understanding Gather-Scatter

loop fission by

Using Loop Fission

loop fusion by

Using Loop Fusion

loop interchange

Using Loop Interchange

loop unrolling

Using Outer Loop Unrolling

prefetching by

Prefetch Overhead and Unrolling

requesting

Requesting LNO

transformed source file

Reading the Transformation File

vector intrinsic transformation

Vector Intrinsics

loop peeling

Using Loop Fusion

loop unrolling

and roundoff

Roundoff Control

and SWP

Using Outer Loop Unrolling

by loop nest optimizer (LNO)

Using Outer Loop Unrolling

with loop interchange

Combining Loop Interchange and Loop Unrolling

makefile

example

Basic Makefile

use of

Using a Makefile

math libraries

Exploiting Existing Tuned Code

vector intrinsics

Standard Math Library

matrix multiply

loop unrolling of

Using Outer Loop Unrolling

memory use in

Understanding Cache Blocking

performance of

Understanding Cache Blocking

matrix multipy

cache blocking of

Controlling Cache Blocking

memory

64-bit addressing

Selecting an ABI and ISA

administrator setup

Using Larger Page Sizes to Reduce TLB Misses
Trying Dynamic Page Migration

bus-based

Memory for Multiprocessors
Scalability in Multiprocessors

cache directory bits

Memory Overhead Bits

contention for

Memory Contention

distributed versus shared

Shared Memory Multiprocessing

error correction bits

Memory Overhead Bits

hierarchy

Understanding the Levels of the Memory Hierarchy

latency of

SN0 Latencies and Bandwidths
Degrees of Latency

locality management

Memory Locality Management

management by IRIX

SN0 Memory Management

page fault

Understanding TLB and Virtual Memory Use

paged virtual

Understanding TLB and Virtual Memory Use

parallel execution tuning

Finding and Removing Memory Access Problems

physical address display

Page Address Routine va2pa()

placement

first-touch

Using First-Touch Placement
Programming For First-Touch Placement

round-robin

Using Round-Robin Placement
Trying Round-Robin Placement

prefetching

Understanding Prefetching
Using Prefetching

stride

Using Stride-One Access

virtual

Understanding Level-One and Level-Two Cache Use

See also page

Reader Comments

memory locality domain (MLD)

Memory Locality Management
Memory Locality Domain Use

memory locality domain set (MLDS)

Memory Locality Domain Use

Message-Passing Interface (MPI)

Message-Passing Models MPI and PVM

dplace with

Using dplace with MPI 3.1

perfex with

Using perfex with MPI

MIPS CPU

architecture of

Understanding MIPS R10000 Architecture
Understanding Prefetching

event counters in

R10000 Counter Event Types

issued versus graduated instruction

Graduated Instructions

off-chip cache

Level-Two Cache

on-chip cache

Cache Architecture

out-of-order execution

Executing Out of Order

R10000

speculative execution

Hardware Speculative Execution

underflow control

Understanding Treatment of Underflow Exceptions

R4000

Specifying the ABI

R8000

Specifying the ABI
Software Speculative Execution
Dealing with Software Pipelining Failures

underflow ignored on

Understanding Treatment of Underflow Exceptions

specify to compiler

Standard Math Library

speculative execution

Speculative Execution

superscalar features

Superscalar CPU Features

See also hardware event counter

Reader Comments

MIPS IV ISA

MIPS IV Instruction Set Architecture

and IEEE 754

IEEE Conformance

prefetch in

Understanding Prefetching

MP library

See library,libmp

Reader Comments

MPI

See Message-Passing Interface (MPI)

Reader Comments

mpirun

with perfex

Using perfex with MPI

node

SN0 Organization
SN0 Node Board

CPU in

CPUs and Memory

nonuniform memory access (NUMA)

SN0 Memory Distribution
Dealing With Nonuniform Access Time

and parallel program

Parallel Programs under NUMA

and single-threaded program

Single-Threaded Programs under NUMA

numeric error

Understanding Arithmetic Standards

OpenMP directives

Fortran Source with Directives

C pragmas for

C and C++ Source with Pragmas

-OPT

See compiler option, -OPT

Reader Comments

optimization level

Setting Optimization Level with -On

out of order execution

Executing Out of Order

packing

Packing

page

Understanding TLB and Virtual Memory Use

migration of

Dynamic Page Migration
Enabling Page Migration
Trying Dynamic Page Migration

size of

Dynamic Page Migration
Policy Modules
Single-Threaded Programs under NUMA
Using Larger Page Sizes to Reduce TLB Misses

set with dplace

Changing the Page Size

valid sizes

Using Larger Page Sizes to Reduce TLB Misses

page fault

Understanding TLB and Virtual Memory Use

parallel execution

affinity clause

Using Parallel Do with Distributed Data
Understanding the AFFINITY Clause for Data
Understanding the AFFINITY Clause for Threads

Amdahl's law

Understanding Parallel Speedup and Amdahl's Law

auto-parallizing

Compiling Serial Code for Parallel Execution

data placement for

Scalability and Data Placement

memory access tuning for

Finding and Removing Memory Access Problems

nest clause

Understanding the NEST Clause

parallel fraction p

Understanding Amdahl's Law
Ensuring That the Program Is Properly Parallelized

programming models for

Explicit Models of Parallel Computation

scalability of

Scalability in Multiprocessors
Scalability and Data Placement

topology

Specifying the Topology

tuning SN0 for

Tuning Parallel Code for SN0

perfex

Analyzing Performance with perfex

absolute event counts

Taking Absolute Counts of One or Two Events

analytic output

Getting Analytic Output with the -y Option

awk script to parse

Awk Script for Perfex Output

cache use analysis

Identifying Cache Problems with Perfex and SpeedShop

library interface

Collecting Data over Part of a Run

statistical counts

Taking Statistical Counts of All Events

performance

aphorisms about

Bentley's Rules Updated

of matrix multiply

Understanding Cache Blocking

of parallel program

Parallel Programs under NUMA

of single-threaded program

Single-Threaded Programs under NUMA

performance techniques

algebraic identities

Exploit Algebraic Identities
Exploit Algebraic Identities

array padding

Using Array Padding to Prevent Thrashing
Using Array Padding

avoiding tests

Combining Tests

cache blocking

Understanding Cache Blocking
Controlling Cache Blocking
Controlling Cache Blocking

caching

Principles of Good Cache Use

code motion

Code Motion Out of Loops

combining related functions

Combine Paired Computation

common block padding

Exploiting Interprocedural Analysis

common subexpressions

Eliminate Common Subexpressions

constant propagation

Exploiting Interprocedural Analysis

copying

Using Copying to Circumvent TLB Thrashing

coroutines

Use Coroutines

data structure augmentation

Data Structure Augmentation

dead function elimination

Exploiting Interprocedural Analysis

dead variable elimination

Exploiting Interprocedural Analysis

gather-scatter

Understanding Gather-Scatter

inlining

Exploiting Interprocedural Analysis
Collapse Procedure Hierarchies

interpreters

Interpreters

lazy evaluation

Lazy Evaluation

loop fission

Using Loop Fission

loop fusion

Understanding Loop Fusion
Using Loop Fusion
Loop Fusion

loop interchange

Using Loop Interchange

loop unrolling

Using Outer Loop Unrolling
Loop Unrolling

packing

Packing

precomputation

Store Precomputed Results
Precompute Logical Functions

prefetching

Using Prefetching

recursion elimination

Transform Recursive Procedures

short-circuiting

Short-Circuit Monotone Functions

software pipelining

Understanding Software Pipelining

speculative execution

Permitting Speculative Execution

transposition

Understanding Transpositions

policy module (PM)

Memory Locality Management
Policy Modules

Portable Virtual Machine (PVM)

Message-Passing Models MPI and PVM

POSIX threads

C Source Using POSIX Threads

pragma

See directive

Reader Comments

precomputation

Store Precomputed Results

prefetching

Understanding Prefetching
Using Prefetching

controlling

Controlling Prefetching

manual

Using Manual Prefetching

overhead of

Prefetch Overhead and Unrolling

pseudo

Using Pseudo-Prefetching

prof

default report

Displaying Profile Reports from Sampling

feedback file

Creating a Compiler Feedback File

ideal time report

Default Ideal Time Profile

line numbers off with opt

Including Line-Level Detail

option -archinfo

Displaying Operation Counts

option -butterfly

Displaying Ideal Time Call Hierarchy

option -feedback

Creating a Compiler Feedback File
Passing a Feedback File

option -heavy

Displaying Profile Reports from Sampling
Including Line-Level Detail

option -lines

Including Line-Level Detail

simplifying report

Removing Clutter from the Report

profiling

address space usage

Using Address Space Profiling

cache usage

Identifying Cache Problems with Perfex and SpeedShop

call hierarchy

Profiling the Call Hierarchy

ideal time for

Using Ideal Time Profiling
Identifying Cache Problems with Perfex and SpeedShop

opcode counts

Displaying Operation Counts

sampling for

Understanding Sample Time Bases
Identifying Cache Problems with Perfex and SpeedShop

tools for

Profiling Tools

program correctness

Getting the Right Answers

R4000

See MIPS CPU

Reader Comments

R8000

See MIPS CPU

Reader Comments

R10000

See MIPS CPU

Reader Comments

roundoff

Roundoff Control

round-robin placement

Using Round-Robin Placement
Trying Round-Robin Placement

scalability

Scalability in Multiprocessors

and bus architecture

Scalability in Multiprocessors

and data placement

Scalability and Data Placement

and shared memory

Scalability and Shared, Distributed Memory

smake

Using a Makefile

SN0

CrayLink

Hub and NUMAlink

hub

SN0 Organization
Hub and NUMAlink

Input/Output

SN0 Input/Output

latencies

SN0 Latencies and Bandwidths

node

SN0 Organization
SN0 Node Board

router

SN0 Organization

XIO

SN0 Organization
XIO Connection

SN0 architecture

Understanding SN0 Architecture

building blocks of

SN0 Organization

hypercube

SN0 Organization
SN0 Memory Distribution

nonuniform memory access (NUMA)

SN0 Memory Distribution

snoopy cache

Coherency Methods

software pipelining (SWP)

Exploiting Software Pipelining

compiler report in

script to extract

Software Pipeline Script swplist

compiler report in .s

Reading Software Pipelining Messages
Using Outer Loop Unrolling

dereferenced pointer defeats

Improving C Loops

effect of alias model

Understanding Aliasing Models

enable with -O3

Enabling Software Pipelining with -O3

failure cause

Dealing with Software Pipelining Failures

global variables defeat

Improving C Loops

loop unrolling with

Using Outer Loop Unrolling

of DAXPY loop

Pipelining the DAXPY Loop

speculative execution

Speculative Execution
Permitting Speculative Execution

hardware driven

Hardware Speculative Execution

software-driven

Software Speculative Execution

speedshop

Using SpeedShop

sample time bases

Understanding Sample Time Bases

See also prof, ssrun

Reader Comments

ssrun

exception trace

Profiling Exception Frequency

experiment types

Understanding Sample Time Bases

ideal time trace

Capturing an Ideal Time Trace
Passing a Feedback File

output filename format

Performing ssrun Experiments

shell script to run

Shell Script ssruno

usertime experiment

Displaying Usertime Call Hierarchy

using

Performing ssrun Experiments

stride

Using Stride-One Access

superlinear speedup

Understanding Superlinear Speedup

superscalar

Superscalar CPU Features

-SWP

See compiler option, -SWP

Reader Comments

swplist shell script

Reading Software Pipelining Messages

system routine

mmap

C and C++ Source Using UNIX Processes
Initializing to Zero

sproc

C and C++ Source Using UNIX Processes

sysmp

Advanced Options

syssgi

Using Dynamic Placement Information

thread

C Source Using POSIX Threads

TLB

See translate lookaside buffer (TLB)

Reader Comments

translate lookaside buffer (TLB)

Understanding TLB and Virtual Memory Use

miss

Understanding TLB and Virtual Memory Use

hardware counter

Virtual Memory Use

thrashing elimination

Diagnosing and Eliminating TLB Thrashing

copying

Using Copying to Circumvent TLB Thrashing

larger page size

Using Larger Page Sizes to Reduce TLB Misses

transposition

Understanding Transpositions

trap

See exception

Reader Comments

uninitialized variable, avoiding

Uninitialized Variables

vector intrinsic function

Standard Math Library

and LNO

Vector Intrinsics

virtual memory

Understanding TLB and Virtual Memory Use

XIO

SN0 Organization
XIO Connection

zero-fill

Initializing to Zero