Chapter 3. Setting Up GRIO

This chapter discusses the following:

Installation Requirements

In a cluster deployment, every node must be GRIO-enabled. As of CXFS 3.4, every supported platform is GRIO-enabled. For more information, see the CXFS Administration Guide for SGI InfiniteStorage and the CXFS MultiOS Client-Only Guide for SGI InfiniteStorage.

IRIX Installation Requirements

To operate in local volume domain on an IRIX node, you must install the eoe.sw.grio2 product.

To enable clustered GRIO support on IRIX, you must install both the eoe.sw.grio2 and the cxfs.sw.grio2_cell products.

SGI ProPack Installation Requirements

Install the following RPM on all SGI ProPack nodes in the cluster:

grio2-cmds-version.ia64.rpm

Install the following additional RPM on SGI ProPack server-capable nodes:

grio2-server-version.ia64.rpm

Deployment Considerations for Cluster Volumes

You must observe the following constraints when setting up GRIO filesystems in a cluster:

  • If any of the logical units (LUNs) on a particular device will be managed as GRIO filesystems, then all of the LUNs should be managed as GRIO filesystems. Typically, there will be hardware contention between separate LUNs, both in the storage area network (SAN) and within the storage device. If only a subset of the LUNs are managed, I/O to the unmanaged LUNs could still cause oversubscription of the device and could in turn violate guarantees on the managed filesystems.

  • A storage device containing GRIO-managed filesystems should not be shared between clusters. The GRIO daemons running within different clusters are not coordinated, and unmanaged I/O from one cluster can cause guarantees in the other cluster to be violated.

Data Layout

To set up a filesystem on a RAID device such that you achieve correct filesystem device alignment and maximize I/O performance, remember to do the following:

  • Ensure that each data partition is correctly aligned with the internal disk layout of its LUN

  • Set XVM stripe parameters correctly

  • Pass correct volume geometry (stripe unit and width) to mkfs_xfs(1)

For more information, see the grio2(5) man page.

Choosing a Qualified Bandwidth

You can adjust the qualified bandwidth to reflect the specific trade-off between delivered QoS and utilization of the storage infrastructure for your situation.

The following affect the qualified bandwidth you will choose:

  • The hardware configuration

  • The application work flow and I/O load

  • The specific QoS requirements of applications and users

Typically, the first concern is that the required bandwidth can be delivered by the storage system. The second concern is the service time observed for individual I/Os.

Determining qualified bandwidth is an iterative process. There are several strategies you can use to determine and fine-tune the qualified bandwidth for a filesystem. For example:

  • Establish a given bandwidth and then adjust so that the QoS requirements are met. Do the following:

    1. Make an initial estimate of the qualified bandwidth. You can use the fixed storage architecture parameters (RAID performance, number of HBAs, and so on) to estimate the anticipated peak bandwidth that can be delivered. The qualified bandwidth is then determined as an appropriate fraction of this peak.

    2. Configure ggd2 appropriately using the griotab file or the cxfs_admin(1M) or cmgr(1M) command.

    3. Run a test workload.

    4. Monitor the delivered performance.

    5. Refine the estimate as needed.

  • Establish that QoS requirements are satisfied and then adjust to maximize throughput. To do this, increase the load until the storage system can no longer meet the application QoS requirements; the qualified bandwidth must be lower than this value.

  • Explore the space of possible workloads and test whether a given workload satisfies both bandwidth and application QoS requirements.

Although the hardware configuration provides a basis for calculating an estimate, remember that the qualified bandwidth is also affected by the particular work-flow issues and the QoS requirements of individual applications. For example, an application that has large tunable buffers (such as a flipbook application that does aggressive RAM caching) can tolerate a greater variation in I/O service time than can a media broadcast system that must cue and play a sequence of very short clips. In the first example, the qualified bandwidth would be configured as a larger proportion of the sustained maximum. In the second example, the qualified bandwidth might be reduced to minimize the system utilization levels and improve I/O service times.

A high qualified bandwidth will generally achieve the greatest overall throughput but with the consequence that individual applications may intermittently experience longer service times for some I/Os. This variation in individual service times is referred to as jitter; as the storage system approaches saturation, service-time jitter will typically increase. A lower qualified bandwidth means that total throughput will be reduced, but because the I/O infrastructure is under less stress, individual requests will typically be processed with less variation in individual I/O service times. Figure 3-1 illustrates these basic ideas. The precise relationship between load on the storage system and variation in I/O service time is highly dependent on your storage hardware.

Figure 3-1. Tradeoff Between Throughput and Variation in I/O Service Time (Jitter)

Tradeoff Between Throughput and Variation in
I/O Service Time (Jitter)

Some storage devices (particularly those with real-time schedulers) can provide a fixed bound on I/O service time even at utilization levels close to their maximum. In this case, the qualified bandwidth can be set higher even where applications have tight QoS requirements. The user-adjustable qualified bandwidth provides the flexibility required for GRIO to work with both dedicated real-time devices as well as more common off-the-shelf storage systems.


Note: In all cases, you must verify the chosen qualified bandwidth by testing the storage system under a realistic workload.

You can use the grioqos(1M) tool to measure the delivered QoS performance. This tool extracts QoS performance for an active stream without disturbing the application or the kernel scheduler. GRIO maintains very detailed performance metrics for each active stream. Using the grioqos command while running a workload test lets you answer questions such the following for every active stream in the system:

  • What has been the worst observed bandwidth over a 1-second period?

  • What is the worst observed average I/O service time for a sequence of 10 I/Os?

For more information about GRIO tools and the mechanisms for accessing QoS data within the kernel, see Chapter 5, “Monitoring Quality of Service”, and the grioqos(1M) man page.

Local Volumes and Cluster Volumes

A managed volume can be one of the following:

  • A local volume is attached to the node in question. This volume is in the local volume domain .

    Local volumes are always managed by the instance of the ggd2 daemon running on the node to which they are attached.

  • A cluster volume is used with CXFS filesystems and is shared among nodes in a cluster. This volume is in the cluster volume domain.

    All cluster volumes are managed by a single instance of the ggd2 daemon running on one of the CXFS administration nodes in the cluster; this node is referred to as the GRIO server . There is one GRIO server per cluster.

    The GRIO server is elected automatically. You can relocate it by using the grioadmin(1M) command. The GRIO server must be a CXFS administration node. Client-only nodes will never be elected as GRIO servers. For more information, see Chapter 4, “Administering GRIO”, and the grioadmin(1M) man page.

    If a given CXFS administration node has locally attached volumes and has also been selected as the GRIO server, then the ggd2 running on that node will serve dual-duty and will manage both its own local volume domain and the cluster volume domain.

For more information about CXFS, see “Cluster Volume Domain Configuration”, CXFS Administration Guide for SGI InfiniteStorage, and CXFS MultiOS Client-Only Guide for SGI InfiniteStorage.

Local Volume Domain Configuration

To configure GRIO for local volume domains, you must provide information in the /etc/griotab file.

The /etc/griotab file lists the volumes that should be managed by GRIO and the maximum qualified bandwidth they can deliver. This file is read at startup and whenever ggd2 receives a SIGHUP signal (such as when you issue a killall -HUP ggd2 command). See the ggd2(1M) and griotab(4) man pages for more information.

Cluster Volume Domain Configuration

You must use the cxfs_admin(1M) or cmgr(1M) command to configure cluster volumes for GRIO.

A prompting mode is also available for cxfs_admin(1M) or cmgr. For more information, see the CXFS Administration Guide for SGI InfiniteStorage.

If you have installed the cxfs.sw.grio2_cell subsystem and turned on GRIO, the ggd2 daemon will automatically query the cluster configuration database for GRIO volume configuration information. ggd2 dynamically tracks updates to the cluster database.

cxfs_admin Configuration Examples

There are two GRIO attributes associated with filesystems:

  • grio_managed, which specifies whether a filesystem is managed by GRIOv2 (true) or not ( false). The default is false. Setting grio_managed to false disables GRIO management for the specified filesystem, but it does not reset the grio_qual_bandwidth value. In this case, grio_qual_bandwidth is left unmodified in the cluster database and ignored.

  • grio_qual_bandwidth, which specifies a filesystem's qualified bandwidth in bytes (B suffix), kilobytes (KB), megabytes (MB), or gigabytes (GB), where the units are multiples of 1024. The default is MB for 4000 or less, B for 4001 or greater. If the filesystem is GRIO-managed, you must specify a qualified bandwidth with this attribute. You can modify the qualified bandwidth for a mounted filesystem without taking it offline.


    Note: These are advanced-mode attributes. When configuring GRIO with cxfs_admin, you should use set mode=advanced .

    For example, any one of the following commands sets a filesystem's qualified bandwidth to 1.2 GB/s:

    cxfs_admin:cluster> modify filesystem grio_qual_bandwidth=1288500000
    cxfs_admin:cluster> modify filesystem grio_qual_bandwidth=1258300KB
    cxfs_admin:cluster> modify filesystem grio_qual_bandwidth=1288.8MB
    cxfs_admin:cluster> modify filesystem grio_qual_bandwidth=1288.8MB

cmgr Configuration Examples

To mark a filesystem as GRIO-managed and set its qualified bandwidth, use the following commands:

# /usr/cluster/bin/cmgr
Welcome to SGI Cluster Manager Command-Line Interface

cmgr> modify cxfs_filesystem filesystem in cluster cluster
cmgr> set grio_managed to true
cmgr> set grio_qualified_bandwidth to qualified_bandwidth
cmgr> done

The value for qualified_bandwidth is specified in bytes per second. For example, the following sets the qualified bandwidth to 200 MB/s (200*1024*1024):

cmgr> set grio_qualified_bandwidth to 209715200

To show the current status of a shared filesystem:

cmgr> show cxfs_filesystem filesystem in cluster cluster
...
               GRIO Managed Filesystem: true
               GRIO Managed Bandwidth: qualified_bandwidth
...


Note: In cmgr, you must unmount a filesystem before you can modify it.


Licensing

The GRIO FLEXlm licensing regime controls a number of configuration parameters including the total number of active streams and the total aggregate qualified bandwidth of filesystems under management. Separate license types are provided for the local and cluster volume domains, and license constraints are enforced for each volume domain separately.

The ggd2 daemon checks the license at startup, whenever it detects a configuration change, or when it receives a SIGHUP signal.

The license enforcement policy for streams is straightforward. The license for a given volume domain specifies a maximum number of active streams. All reservation requests above this limit are denied.

In the case of bandwidth, a license specifies the maximum total aggregate qualified bandwidths for all volumes within the volume domain. The ggd2 daemon validates the configuration at startup and whenever the configuration is changed:

  • For the local domain, ggd2 tracks changes to /etc/griotab (ggd2 is notified of changes with a SIGHUP)

  • For the cluster volume domain, ggd2 tracks the relevant cluster database entries for cluster volume qualified bandwidth

If the configuration of a volume domain is altered and becomes unlicensed, ggd2 enters a passive mode in which all further requests pertaining to that domain, with the exception of release requests, are denied. A message is sent to the system log and that volume domain will remain deactivated until the configuration returns to a licensed state, at which time another message will be logged indicating the domain is again active.

For more information, see the license.dat (5) man page.

ggd2.options File

The ggd2.options file contains the command line options for ggd2 when launched at startup.

The location of the file differs by operating system:

  • IRIX: /etc/config/ggd2.options

  • SGI ProPack: /etc/cluster/config/ggd2.options

You can uncomment and edit lines as required. The arguments are as follows:

-d level

Sets the maximum debug level and logs the specified level of messages to both the system log and to an additional log file called /var/tmp/ggd2log/ PID.

level is an integer in the range 0 through 4(the higher the level number, the more debug information that is output). The levels are as follows:

  • 0 logs critical system resource errors

  • 1 logs the above plus ggd2 -specific error and warning messages

  • 2 logs the above plus important events or state changes

  • 3 logs the above plus infrequent, less-important events

  • 4 logs the above plus debug-level messages

By default, ggd2 logs just level 0 critical system resource errors to the system log only.

-f

Runs ggd2 in the foreground. By default, ggd2 is started as a daemon.

-m bw

Specifies the minimum amount of bw bandwidth in KB/s (default) that ggd2 will allocate for non-guaranteed user and system I/O per GRIO-managed volume. All nodes issuing non-GRIO I/O will receive a fair share of this minimum bandwidth. A node will be allocated the bigger value specified by -m or -s.

For example, -m2048 causes ggd2 to allocate a minimum of 2048 KB/s to each GRIO-managed volume. This bandwidth becomes permanently allocated to non-GRIO I/O and cannot be reserved for GRIO I/O. Use the suffix K or M to explicitly specify bandwidth in KB/s or MB/s. For example, -m3M causes ggd2 to allocate a minimum of 3 MB/s to each GRIO-managed volume.

-r percent

Reserves a percentage of each volume's available qualified bandwidth for GRIO I/O. Reservation requests are then serviced directly from this pool of cached free bandwidth without blocking. percent is the percentage of each volume's qualified bandwidth that ggd2 attempts to keep unallocated, expressed as an integer in the range 0 through 100. You should choose this value based on the following:

  • Expected I/O utilization levels

  • Importance of minimizing the stream creation latency

  • Expected rate at which reservation requests will be made

By default, ggd2 allows unreserved bandwidth to be allocated for servicing non-GRIO I/O. This maximizes the total throughput of the system. However, as ggd2 only makes adjustments to these allocations periodically, a new reservation may block until ggd2 can reclaim the requested bandwidth.


Note: Using the -r option causes a proportion of the unreserved I/O capacity to remain unused and reduces the total throughput and efficiency of the system for non-GRIO I/O. You should only use this option if minimizing reservation latency is a priority.

For example, given a volume with a qualified bandwidth of 200 MB/s, -r 20 will instruct ggd2 to try to keep up to 20% (40 MB/s) of any remaining unreserved bandwidth cached and available for servicing reservation requests directly. ggd2 adjusts this cache of free bandwidth every time the distributed bandwidth allocator (DBA) runs, which defaults to once every 2 seconds (see -u). With these settings, ggd2 will be able to grant an additional 40 MB/s every 2 seconds without blocking any reservation requests.

-s bw

Specifies the minimum amount of bw bandwidth in KB/s (default) that ggd2 will allocate for non-GRIO user and system I/O per node. A node will be allocated the bigger value specified by -s or -m.

For example, -s2048 causes ggd2 to allocate a minimum of 2048 KB/s to each node accessing a GRIO-managed volume. This bandwidth becomes permanently allocated to non-GRIO I/O and cannot be reserved for GRIO I/O. Use the suffix K or M to explicitly specify bandwidth in KB/s or MB/s. For example, -s3M causes ggd2 to allocate a minimum of 3 MB/s to each node accessing a GRIO-managed volume.

-u ms

Specifies the distributed bandwidth allocator (DBA) update interval in milliseconds (ms), where ms is a value in the range 250 through 100000. The default is 2000 (2 seconds). For more information about DBA, see “Managing Bandwidth: Encapsulation and Distributed Bandwidth Allocator” in Chapter 2.


Note: The rate at which the DBA runs affects the delay that an application or node that does not have a GRIO reservation might experience when it starts doing I/O. The longer the interval, the longer a node may have to wait (with its I/O paused) before ggd2 will increase its allocation.


For example:

# command line options for ggd2 when launched from /etc/init.d/grio2
# uncomment/edit lines as required
#

# Minimum non-GRIO bandwidth per node. Units are Mbytes/sec
# -s 1M

# debug level, in the range 0 to 4
# -d 1

# minimum reserve bandwidth to accommodate short-latency reservation
# demands, expressed as a percentage of the total qualified bandwidth
# -r 30

# Distributed Bandwidth Allocator (DBA) update interval
# value in milliseconds
# -u 2000

For changes to this file to take effect, do one of the following on the GRIO server, which will cause ggd2 to reread its options file:

  • Stop and restart the grio2 service:

    # /etc/init.d/ grio2 stop
    # /etc/init.d/grio2 start

  • Run the following:

    # run killall -HUP ggd2

To determine the active GRIO server, use the grioadmin -sv command.


Note: In the event of a GRIO server relocation or recovery, you must perform the above steps on each GRIO server-capable node in the cluster.


Starting GRIO on IRIX Servers

On IRIX, it is possible to have both the GRIOv1 and GRIOv2 subsystems installed. However, only one of these subsystems can be active. The subsystem that is turned on in chkconfig is started by default at boot time and remains in effect until the chkconfig setting is changed and the machine is rebooted.

Starting GRIOv2 when GRIOv1 is Active

Suppose you were running GRIOv1 and wanted to switch to GRIOv2. After performing the configuration tasks discussed in this guide, you would do the following:

  1. Turn off GRIOv1 ( grio) and turn on GRIOv2 (grio2):

    # chkconfig grio off
    # chkconfig grio2 on

  2. Reboot the system to allow the kernel to be reinitialized with the GRIOv2 scheduler.

You do not need to manually start GRIOv2 because the daemon is automatically started upon reboot when the chkconfig setting is on.


Note: If GRIOv1 is still enabled when you perform a GRIOv2 library call, the return will be ENOSYS. If you do not have either the GRIOv1 or GRIOv2 kernel initialized, the return will be EAGAIN, indicating that the subsystem has not yet initialized and the application should retry the request.


Starting GRIOv2 when GRIOV1 is Not Active

If you have not run GRIOv1 during the current boot session, you can start GRIOv2 by doing the following:

  1. Turn on GRIOv2:

    # chkconfig grio2 on

  2. Start GRIOv2:

    # /etc/init.d/grio2 start

You must perform the manual start only once. When the machine is rebooted, GRIOv2 will be restarted automatically as long as its chkconfig setting remains on.