Chapter 2. How GRIO Works

This chapter discusses the following:

Traffic Control

GRIO is a component on the XFS and CXFS I/O path that runs in every node with access to a GRIO-managed filesystem. When active, all I/O on a machine and in the cluster is controlled by the GRIO scheduler.

I/O falls into the following categories:

  • GRIO I/O: I/O for applications that have made an explicit GRIO reservation

  • Non-GRIO I/O: all other buffered and system I/O

GRIO ensures that applications with reserved bandwidth receive data at the requested rate, regardless of other I/O activity on the node and elsewhere within the cluster. GRIO will throttle an application if it attempts to use more bandwidth than it has reserved.

Stream Use and Real-Time File Setup

In order to use a GRIO reservation, a file must be read or written using direct, synchronous I/O requests. The open(2) man page describes the use and buffer alignment restrictions of the direct I/O interface. A GRIO reservation can be made for any file within an XFS or CXFS filesystem created on an XVM volume. However, for optimal performance, files should be created on a dedicated real-time subvolume.

To allocate a file on the real-time subvolume of an XFS or CXFS filesystem, you must use the fcntl(2) F_FSSETXATTR command to set the XFS_XFLAG_REALTIME flag. You can only issue this command on a newly created file. It is not possible to mark a file as real-time after non-real-time data blocks have been allocated to it.

Software Components

GRIO functionality is distributed between the following main components:

ggd2 Daemon

The ggd2(1M) daemon is a user-level process started at system boot time that manages the I/O bandwidth for a collection of XVM volumes. It does the following:

  • Activates/deactivates the GRIO kernel scheduler

  • Processes client requests to reserve and release bandwidth

  • Tracks bandwidth utilization

  • Manages unallocated bandwidth

  • Prevents oversubscription

  • Enforces GRIO software licenses

  • Broadcasts to the relevant kernels the amount of bandwidth per filesystem that may be used for non-GRIO I/O

Qualified Bandwidth

The qualified bandwidth for a filesystem is the maximum I/O load that it can sustain while still satisfying requirements for delivered bandwidth and I/O service time.

You must determine a specific qualified bandwidth for each GRIO-managed filesystem.

The qualified bandwidth is specified in the griotab file for local IRIX filesystems or in the cluster database for shared filesystems.

The ggd2 daemon is responsible for managing the allocation of available bandwidth between different applications.

You can adjust the qualified bandwidth as needed to make the best use of your system, taking into account the tradeoff between resource utilization and delivered I/O performance. For more information, see “Choosing a Qualified Bandwidth” in Chapter 3.

Managing Bandwidth: Encapsulation and Distributed Bandwidth Allocator

The ggd2 daemon tracks the total qualified bandwidth and ensures that the total workload never exceeds the qualified bandwidth.

When ggd2 begins managing a filesystem, every node with access to that filesystem is notified. Each node in turn creates a dedicated system stream for that filesystem, which is called the non-GRIO stream. From that point on, all user and system I/O that does not have an explicit GRIO reservation is encapsulated by this stream and then managed by the GRIO scheduler. For a local filesystem, there is a single non-GRIO stream. For a shared filesystems, there is a non-GRIO stream on each node with access to the filesystem.


Note: The scheduling for all non-GRIO I/O within a GRIO-managed filesystem received from different applications and system services is on a first-come-first-served basis.

To keep the total throughput of the filesystem high even when there are active GRIO streams, ggd2 attempts to allocate the unreserved portion of the qualified bandwidth for use by non-GRIO applications. This bandwidth is effectively lent for short periods of time until ggd2 receives a new request for guaranteed-rate bandwidth, at which point it is reclaimed. GRIO applications have priority over non-GRIO applications.

The ggd2 daemon periodically adjusts the amount of bandwidth allocated to the individual non-GRIO streams for its managed filesystems. This functionality is referred to as the distributed bandwidth allocator (DBA). The DBA is responsible for determining how unreserved bandwidth is distributed between the nodes with access to the filesystem. By default, the DBA runs every two seconds, constantly allocating free bandwidth to nodes based on a range of dynamically monitored demand and utilization metrics.

Calls to reserve bandwidth may block until the next DBA cycle. Therefore, applications must be prepared for delays when setting up guaranteed-rate streams. Refer to grio_reserve()(3X) for more information. To help manage this, you can use the -r option to cause ggd2 to keep a cache (or reserve) of bandwidth that is unavailable for non-GRIO use, from which new GRIO reservations can be processed directly. Any additional bandwidth that is available is free to be used by either future GRIO or DBA streams. Figure 2-1 represents these concepts.

Figure 2-1. Qualified Bandwidth

Qualified Bandwidth

A user process can request a reservation using the grio_reserve() and grio_reserve_fd() library calls. Requests are forwarded to the ggd2 that is actively managing the target volume domain. Requests to filesystems in the local domain are immediately sent to the local instance of ggd2. Requests to filesystems cluster domain are forwarded to the GRIO server, which may be running on a different node in the cluster.

GRIO Server Relocation and Recovery

Each instance of ggd2 maintains reservation and bandwidth state that must be kept consistent with one or more kernels.

If ggd2 fails, a new ggd2 instance will receive from the local kernel all of the information necessary to reestablish the following:

  • Local volume reservations

  • Cluster volume reservations (if the ggd2 that failed was the GRIO server for the cluster)

  • All of the DBA state for non-GRIO I/O

If GRIO server node fails, a new GRIO server is automatically elected and all of the cluster volume reservations and DBA state is reestablished by that instance of ggd2.

You can also choose to manually migrate the GRIO server to another CXFS administration node in the cluster.