Chapter 2. How GRIO Works

This chapter discusses the following:

Traffic Control

GRIO is a component on the XFS and CXFS I/O path that runs in every node with access to a GRIO-managed filesystem. When active, all I/O on a machine and in the cluster is controlled by the GRIO scheduler.

I/O falls into the following categories:

  • GRIO I/O: direct (non-buffered) I/O for applications that have made an explicit GRIO reservation

  • Non-GRIO I/O: all other buffered and system I/O

GRIO ensures that applications with reserved bandwidth receive data at the requested rate, regardless of other I/O activity on the node and elsewhere within the cluster. GRIO will throttle an application if it attempts to use more bandwidth than it has reserved.

Stream Use and Real-Time File Setup

In order to use an application-level GRIO reservation, a file must be read or written using direct I/O requests. The open(2) man page describes the use and buffer alignment restrictions of the direct I/O interface. A GRIO reservation can be made for any regular file within an XFS or CXFS filesystem created on an XVM volume.

Both XFS and CXFS provide a dedicated real-time subvolume that allows filesystem metadata to be separate from user data. To allocate a file on the real-time subvolume of an XFS or CXFS filesystem, you must use the fcntl(2) F_FSSETXATTR command to set the XFS_XFLAG_REALTIME flag. You can only issue this command on a newly created file. It is not possible to mark a file as real-time after non-real-time data blocks have been allocated to it.

Software Components

GRIO functionality is distributed between the following main components:

ggd2 Daemon

The ggd2(1M) daemon is a user-level process started at system boot time that manages the I/O bandwidth for a collection of XVM volumes. It does the following:

  • Activates and deactivates the GRIO kernel scheduler

  • Processes client requests to reserve and release bandwidth

  • Tracks bandwidth utilization

  • Manages unallocated bandwidth

  • Prevents oversubscription

  • Enforces GRIO software licenses

  • Broadcasts to the relevant kernels the amount of bandwidth per filesystem that may be used for non-GRIO I/O

Qualified Bandwidth

The qualified bandwidth for a filesystem is the maximum I/O load that it can sustain while still satisfying requirements for delivered bandwidth and I/O service time.

You must determine a specific qualified bandwidth for each GRIO-managed filesystem. The qualified bandwidth is specified in the griotab file for local IRIX XFS filesystems or by using the cxfs_admin(1M) or cmgr (1M) command for shared CXFS filesystems.

The ggd2 daemon is responsible for managing the allocation of available bandwidth between different applications and nodes.

You can adjust the qualified bandwidth as needed to make the best use of your system, taking into account the tradeoff between resource utilization and delivered I/O performance. For more information, see “Choosing a Qualified Bandwidth” in Chapter 3.

Managing Bandwidth: Encapsulation and Distributed Bandwidth Allocator

The ggd2 daemon tracks the total qualified bandwidth and ensures that the total workload never exceeds the qualified bandwidth.

When ggd2 begins managing a filesystem, every node with access to that filesystem is notified. Each node in turn creates a dedicated system stream for that filesystem, which is called the non-GRIO stream. From that point on, all user and system I/O that does not have an explicit GRIO reservation is encapsulated by this stream and then managed by the GRIO scheduler. For a local IRIX XFS filesystem, there is a single non-GRIO stream. For a shared CXFS filesystem, there is a non-GRIO stream on each node with access to the filesystem.


Note: The scheduling for all non-GRIO I/O within a GRIO-managed filesystem received from different applications and system services is on a first-come-first-served basis.

GRIO supports application-level reservations (created by GRIO-enabled applications using the libgrio2 interfaces) and node-level bandwidth allocations (configured using the GRIO administration interfaces). The libgrio2 interfaces permit an application to do the following:

  • Reserve bandwidth

  • Dynamically bind and unbind the resulting GRIO stream to any number of open file descriptors

  • Modify its reservation

  • Release its bandwidth back to the system

GRIO ensures that the requested guarantee is met for the aggregate I/O performed across the bound file descriptors.

However, applications are often not GRIO-enabled. For these applications, GRIO allows an administrator to configure a node-level bandwidth allocation. From CXFS 4.0 onwards, GRIO supports two types of node-level allocations:

  • Floor allocations ( -F), for which GRIO ensures that the node receives at least the configured bandwidth. While there is any unallocated bandwidth, and the node is issuing I/O at a rate greater than its initial allocation, ggd2 will attempt to increase its allocation temporarily to help service the additional demand.

  • Ceiling allocations ( -C), for which the node receives at most the configured bandwidth. That is, the configured bandwidth acts as a cap on the amount of I/O that the node will be permitted to issue. This is the default.

To keep the total throughput of the filesystem high even when there are active GRIO streams or node-level allocations, ggd2 attempts to allocate any unreserved portion of the qualified bandwidth for use by non-GRIO applications. This bandwidth is effectively lent for short periods of time until ggd2 receives a new request for guaranteed-rate bandwidth, at which point it is reclaimed. GRIO applications have priority over non-GRIO applications.

The ggd2 daemon periodically adjusts the amount of bandwidth allocated to the individual non-GRIO streams for its managed filesystems. This functionality is referred to as the distributed bandwidth allocator (DBA). The DBA is responsible for determining how unreserved bandwidth is distributed between the nodes with access to the filesystem. By default, the DBA runs every two seconds , constantly allocating free bandwidth to nodes based on a range of dynamically monitored demand and utilization metrics. (To change the DBA default, edit the ggd2.options file on each of the GRIO server-capable nodes. For more information, see “ggd2.options File” in Chapter 3.)


Note: The rate at which the DBA runs affects the delay that an application or node that does not have a GRIO reservation might experience when it starts doing I/O. The longer the interval, the longer a node may have to wait (with its I/O paused) before ggd2 will increase its allocation.


Figure 2-1. Qualified Bandwidth

Qualified Bandwidth

A user process can request a reservation using the grio_reserve() and grio_reserve_fd() library calls. Requests are forwarded to the ggd2 that is actively managing the target volume domain. Requests to filesystems in the local domain are sent to the local instance of ggd2 directly. Requests to filesystems in the cluster domain are forwarded to ggd2 on the GRIO server, which may be running on a different node in the cluster.

For more information, see the grio_reserve (3X) man page.

GRIO Server Relocation and Recovery

Each instance of ggd2 maintains reservation and bandwidth state that must be kept consistent with one or more kernels.

If ggd2 fails, a new ggd2 instance will receive from the local kernel all of the information necessary to reestablish the following:

  • Local volume reservations

  • Cluster volume reservations (if the ggd2 that failed was the GRIO server for the cluster)

  • All of the DBA state for non-GRIO I/O

If a GRIO server node fails, a new GRIO server is automatically elected and all of the cluster volume reservations and DBA state is reestablished by that instance of ggd2.

You can also choose to manually migrate the GRIO server to another CXFS server-capable administration node in the cluster. For more information, see Chapter 4, “Administering GRIO” and the grioadmin(1M) man page.

To determine the active GRIO server, use the grioadmin -sv command.