Chapter 9. High-Performance File I/O

This chapter describes three special modes of disk I/O:

Using Synchronous Output

You use synchronous disk output to prevent the IRIX kernel scheme from deferring disk output.

About Buffered Output

When you open a disk file and do not specify the O_SYNC flag (see the open(2) reference page), a call to write() for that file descriptor returns as soon as the data has been copied to a buffer in the kernel address space.

The actual disk write may not take place until considerable time has passed. A common pool of disk buffers is used for all disk files. (The size of the pool is set by the nbuf system configuration variable, and defaults to approximately 2.5% of all physical memory.) Disk buffer management is integrated with the virtual memory paging mechanism. A daemon executes periodically and initiates output of buffered blocks according to the age of the data and the needs of the system.

The default management of disk output improves performance for the system in general but has three drawbacks:

  • All output data must be copied from the buffer in process address space to a buffer in the kernel address space. For small or infrequent writes, the copy time is negligible, but for large quantities of data it adds up.

  • You do not know when the written data is actually safe on disk. A system crash could prevent the output of a large amount of buffered data.

  • When the system does decide to flush output buffers to disk, it can generate a large quantity of I/O that monopolizes the disk channel for a long time, delaying other I/O operations.

You can force the writing of all pending output for a file by calling fsync() (see the fsync(2) reference page). This gives you a way of creating a known checkpoint of a file. However, fsync() blocks until all buffered writes are complete, possibly a long time. When using asynchronous I/O, you can make file synchronization asynchronous also (see “Assuring Data Integrity”).

Requesting Synchronous Output

When you open a disk file specifying O_SYNC, each call to write() blocks until the data has been written to disk. This gives you a way of ensuring that all output is complete as it is created. If you combine O_SYNC access with asynchronous I/O, you can let the asynchronous process suffer the delay (see “About Asynchronous I/O”).

Synchronous output is still buffered output—data is copied to a kernel buffer before writing. The meaning of O_SYNC is that the file data is all present even if the system crashes. For this reason, each write to an O_SYNC file can cause a write of file metadata as well as the file data itself. (Specifying O_DSYNC causes a write of file data without waiting for the file metadata.) These extra writes can make synchronous output quite slow.

The O_SYNC option takes effect even when the amount of data you write is less than the physical blocksize of the disk, or when the output does not align with the physical boundaries of disk blocks. In order to guarantee writing of misaligned data, the kernel has to read disk blocks, update them, and write them back. If you write using incomplete disk blocks (512 bytes) on block boundaries, synchronous output is slower.

Using Direct I/O

You can bypass the kernel's buffer cache completely by using the option O_DIRECT. Under this option, writes to the file take place directly from your program's buffer to the device—the data is not copied to a buffer in the kernel first. In order to use O_DIRECT you are required to transfer data in quantities that are multiples of the disk blocksize, aligned on blocksize boundaries. (The requirements for O_DIRECT use are documented in the open(2) and fcntl(2) reference pages.)

An O_DIRECT read() or write() is synchronous—control does not return until the disk operation is complete. Also, an O_DIRECT read() call always causes disk input—there is no input cache. However, you can open a file O_DIRECT and use the file descriptor for asynchronous I/O, so that the delays are taken by an asynchronous thread (see “About Asynchronous I/O”).

Direct I/O is required when you use guaranteed-rate I/O (see “Using Guaranteed-Rate I/O”).

Direct I/O Example

The program in Example 9-1 allows you to experiment and compare buffered output, synchronized output, and direct output. An example of using it might resemble this:

> timex dirio -o /var/tmp/dout -m b -b 4096 -n 100
real        0.10
user        0.01
sys         0.02
> timex dirio -o /var/tmp/dout -m d -b 4096 -n 100
real        1.35
user        0.01
sys         0.06
> timex dirio -o /var/tmp/dout -m s -b 4096 -n 100
real        3.43
user        0.01
sys         0.09

 

Example 9-1. Source of Direct I/O Example

/*
|| dirio: program to test and demonstrate direct I/O.
||
|| dirio  [-o outfile] [ -m {b|s|d} ] [ -b bsize ] [ -n recs ] [ -i ]
||
||  -o outfile      output file pathname, default $TEMPDIR/dirio.out
||
||  -m {b|s|d}      file mode: buffered (default), synchronous, or direct
||
||  -b bsize        blocksize for each write, default 512
||
||  -n recs         how many writes to do, default 1000
||
||  -i              display info from fcntl(F_DIOINFO)
||
*/
#include <errno.h>      /* for perror() */
#include <stdio.h>      /* for printf() */
#include <stdlib.h>     /* for getenv(), malloc(3c) */
#include <sys/types.h>  /* required by open() */
#include <unistd.h>     /* getopt(), open(), write() */
#include <sys/stat.h>   /* ditto */
#include <fcntl.h>      /* open() and fcntl() */

int main(int argc, char **argv)
{
    char*       tmpdir;         /* ->name string of temp dir */
    char*       ofile = NULL; /* argument name of file path */
    int         oflag = 0;      /* -m b/s/d result */
    size_t      bsize = 512;    /* blocksize */
    void*       buffer;         /* aligned buffer */
    int         nwrites = 1000; /* number of writes */
    int         ofd;            /* file descriptor from open() */
    int         info = 0;       /* -i option default 0 */
    int         c;              /* scratch var for getopt */
    char        outpath[128];   /* build area for output pathname */    
    struct dioattr dio;
 
    /*
    || Get the options
    */
    while ( -1 != (c = getopt(argc,argv,"o:m:b:n:i")) )
    {
        switch (c)
        {
        case 'o': /* -o outfile */
        {
            ofile = optarg;
            break;
        }
        case 'm': /* -m mode */
        {
            switch (*optarg)
            {
            case 'b' : /* -m b buffered i.e. normal */
                oflag = 0;
                break;
            case 's' : /* -m s synchronous (but not direct) */
                oflag = O_SYNC;
                break;
            case 'd' : /* -m d direct */
                oflag = O_DIRECT;
                break;
            default:
                fprintf(stderr,"? -m %c\n",*optarg);
                return -1;          
            }
            break;
        }
        case 'b' : /* blocksize */
        {
            bsize = strtol(optarg, NULL, 0);
            break;
        }
        case 'n' : /* number of writes */
        {
            nwrites = strtol(optarg, NULL, 0);
            break;
        }
        case 'i' : /* -i */
        {
            info = 1;
            break;
        }
        default:
            return -1;
        } /* end switch */
    } /* end while */
    /*
    || Ensure a file path
    */
    if (ofile)
        strcpy(outpath,ofile);
    else
    {
        tmpdir = getenv("TMPDIR");
        if (!tmpdir)
            tmpdir = "/var/tmp";
        strcpy(outpath,tmpdir);
        strcat(outpath,"/dirio.out");
    }
    /*
    || Open the file for output, truncating or creating it
    */
    oflag |= O_WRONLY | O_CREAT | O_TRUNC;
    ofd = open(outpath,oflag,0644);
    if (-1 == ofd)
    {
        char msg[256];
        sprintf(msg,"open(%s,0x%x,0644)",outpath,oflag);
        perror(msg);
        return -1;
    }
    /*
    || If applicable (-m d) get the DIOINFO for the file and display.
    */
    if (oflag & O_DIRECT)
    {
        (void)fcntl(ofd,F_DIOINFO,&dio);
        if (info)
        {
        printf("dioattr.d_mem    : %8d (0x%08x)\n",dio.d_mem,dio.d_mem);
        printf("dioattr.d_miniosz: %8d (0x%08x)\n",dio.d_miniosz,dio.d_miniosz);
        printf("dioattr.d_maxiosz: %8d (0x%08x)\n",dio.d_maxiosz,dio.d_maxiosz);
        }
        if (bsize < dio.d_miniosz)
        {
            fprintf(stderr,"bsize %d too small\n",bsize);
            return -2;
        }
        if (bsize % dio.d_miniosz)
        {
            fprintf(stderr,"bsize %d is not a miniosz-multiple\n",bsize);
            return -3;
        }
        if (bsize > dio.d_maxiosz)
        {
            fprintf(stderr,"bsize %d too large\n",bsize);
            return -4;
        }
    }
    else
    { /* set a default alignment rule */
        dio.d_mem = 8;
    }
    /*
    || Get a buffer aligned the way we need it.
    */
    buffer = memalign(dio.d_mem,bsize);
    if (!buffer)
    {
        fprintf(stderr,"could not allocate buffer\n");
        return -5;
    }
    bzero(buffer,bsize);
    /*
    || Write the number of records requested as fast as we can.
    */
    for(c=0; c<nwrites; ++c)
    {
        if ( bsize != (write(ofd,buffer,bsize)) )
        {
            char msg[80];
            sprintf(msg,"%d th write(%d,buffer,%d)",c+1,ofd,bsize);
            perror(msg);
            break;
        }
    }
    /*
    || To level the playing field, sync the file if not sync'd already.
    */
    if (0==(oflag & (O_DIRECT|O_SYNC)))
        fdatasync(ofd);
 
    close(ofd);
    return 0;
}


Using a Delayed System Buffer Flush

When your application has both clearly defined times when all unplanned disk activity should be prevented, and clearly defined times when disk activity can be tolerated, you can use the syssgi() function to control the kernel's automatic disk writes.

Prior to a critical section of length s seconds that must not be interrupted by unplanned disk writes, use syssgi() as follows:

syssgi(SGI_BDFLUSHCNT,s);

The kernel will not initiate any deferred disk writes for s seconds. At the start of a period when disk activity can be tolerated, initiate a flush of the kernel's buffered writes with syssgi() as follows:

syssgi(SGI_SSYNC);


Note: This technique is meant for use in a uniprocessor. Code executing in an isolated CPU of a multiprocessor is not affected by kernel disk writes (unless a large buffer flush monopolizes a needed bus or disk controller).


Using Guaranteed-Rate I/O

Under specific conditions, your program can demand a guaranteed rate of data transfer. You would use this feature, for example, to ensure input of data for a real-time video display, or to ensure adequate disk bandwidth for high-speed telemetry capture.

About Guaranteed-Rate I/O

Guaranteed-rate I/O (GRIO) allows a program to request a specific data bandwidth to or from a filesystem. The GRIO subsystem grants the request if that much requested bandwidth is available from the hardware. For the duration of the grant, the application is assured of being able to move the requested amount of data per second. Assurance of this kind is essential to real-time data capture and digital media programming.

GRIO is a feature of the XFS filesystem support—EFS, the older IRIX file system, does not support GRIO. In addition, the optional subsystem eoe.sw.xfsrt must be installed.With IRIX 6.5, GRIO is supported on XLV volumes over disks or RAID systems.

GRIO is available only to programs that use direct I/O (see “Using Direct I/O”).

The concepts of GRIO are covered in sources you should examine:

IRIX Admin:Disks and Filesystems

Documents the administration of XFS and XLV in general, and GRIO volumes in particular.

grio(5)

Reference page giving an overview of GRIO use.

grio(1M)

Reference page for the administrator command for querying the state of the GRIO system.

ggd(1M)

Reference page for the GRIO grant daemon.

grio_disks(4)

Reference page for the configuration files prepared by the administrator to name GRIO devices.


About Types of Guarantees

GRIO offers two types of guarantee: a real-time (sometimes called “hard”) guarantee, and a non-real-time (or “soft”) guarantee. The real-time guarantee promises to subordinate every other consideration, including especially data integrity, to on-time delivery.

The two types of guarantee are effectively the same as long as no I/O read errors occur. When a read error occurs under a real-time guarantee, no error recovery is attempted—the read() function simply returns an error indication. Under a non-real-time guarantee, I/O error recovery is attempted, and this can cause a temporary failure to keep up to the guaranteed bandwidth.

You can qualify either type of guarantee as being Rotor scheduling, also known as Video On Demand (VOD). This indicates a particular, special use of a striped volume. These three types of guarantee, and several other options, are described in detail in IRIX Admin:Disks and Filesystems and in the grio(5) reference page.

About Device Configuration

GRIO is permitted on a device managed by XFS. A real-time guarantee can only be supported on the real-time subvolume of a logical volume created by XLV. The real-time subvolume differs from the more common data subvolume in that it contains only data, no file system management data such as directories or inodes. The real-time subvolume of an XLV volume can span multiple disk partitions, and can be striped.

In addition, the predictive failure analysis feature and the thermal recalibration feature of the drive firmware must be disabled, as these can make device access times unpredictable. For other requirements see IRIX Admin:Disks and Filesystems and the grio(5) reference page.

Creating a Real-time File

You can only request a hard guaranteed rate against a real-time disk file. A real-time disk file is identified by the fact that it is stored within the real-time subvolume of an XFS logical volume.

The file management information for all files in a volume (the directories as well as XFS management records) are stored in the data subvolume. A real-time subvolume contains only the data of real-time files. A real-time subvolume comprises an entire disk device or partition and uses a separate SCSI controller from the data subvolume. Because of these constraints, the GRIO facility can predict the data rate at which it can transfer the data of a real-time file.

You create a real-time file in the following steps, which are illustrated in Example 9-2.

  1. Open the file with the options O_CREAT, O_EXCL, and O_DIRECT. That is, the file must not exist at this point, and must be opened for direct I/O (see “Using Direct I/O”).

  2. Modify the file descriptor to set its extent size, which is the minimum amount by which the file will be extended when new space is allocated to it, and also to establish that the new file is a real-time file. This is done using fcntl() with the FS_FSSETXATTR command. Check the value returned by fcntl() as several errors can be detected at this point.

    The extent size must be chosen to match the characteristics of the disk; for example it might be the “stripe width” of a striped disk.

  3. Write any amount of data to the new file. Space will be allocated in the real-time subvolume instead of the data subvolume because of step (2). Check the result of the first write() call carefully, since this is another point at which errors could be detected.

Once created, you can read and write a real-time file the same as any other file, except that it must always be opened with O_DIRECT. You can use a real-time file with asynchronous I/O, provided it is created with the PROC_SHARE_GUAR option.

Example 9-2. Function to Create a Real-time File

#include <sys/fcntl.h>
#include <sys/fs/xfs_itable.h>
int createRealTimeFile(char *path, __uint32_t esize)
{
   struct fsxattr attr;
   bzero((void*)&attr,sizeof(attr));
   attr.fsx_xflags = XFS_XFLAG_REALTIME;
   attr.fsx_extsize = esize;
   int rtfd = open(path, O_CREAT + O_EXCL + O_DIRECT );
   if (-1 == rtfd)
      {perror("open new file"); return -1; }
   if (-1 == fcntl(rtfd, F_FSSETXATTR, &attr) )
      {perror("fcntl set rt & extent"); return -1; }
   return rtfd; /* first write to it creates file*/
}


Requesting a Guarantee

To obtain a guaranteed rate, a program places a reservation for a specified part of the I/O capacity of a file or a filesystem. In the request, the program specifies

  • the file or filesystem to be used

  • the start time and duration of the reservation

  • the time unit of interest, typically 1 second

  • the amount of data required in any one unit of time

For example, a reservation might specify: starting now, for 90 minutes, 1 megabyte per second. A process places a reservation by calling either grio_request_file() or grio_request_fs() (refer to the grio_request_file(3X) and grio_request_fs(3X) reference pages).

The GRIO daemon ggd keeps information on the transfer capacity of all XFS volumes, as well as the capacity of the controllers and busses to which they are attached. When you request a reservation, XFS tests whether it is possible to transfer data at that rate, from that file, during that time period.

This test considers the capacity of the hardware as well as any other reservations that apply during the same time period to the same subvolume, drives, or controllers. Each reservation consumes some part of the total capacity.

When XFS predicts that the guaranteed rate can be met, it accepts the reservation. Over the reservation period, the available bandwidth from the disk is reduced by the promised rate. Other processes can place reservations against any capacity that remains.

If XFS predicts that the guaranteed rate cannot be met at some time in the reservation period, XFS returns the maximum data rate it could supply. The program can reissue the request for that available rate. However, this is a new request that is evaluated afresh.

During the reservation period, the process can use read() and write() to transfer up to the guaranteed number of bytes in each time unit. XFS raises the priority of requests as needed in order to ensure that the transfers take place. However, a request that would transfer more than the promised number of bytes within a 1-second unit is blocked until the start of the next time unit.

Releasing a Guarantee

A guarantee ends under three circumstances,

  • when the process calls grio_unreserve_bw() (see the grio_unreserve_bw(3X) reference page)

  • when the requested duration expires

  • when all file descriptors held by the requesting process that refer to the guaranteed file are closed (an exception is discussed in the next topic)

When a guarantee ends, the guaranteed transfer capacity becomes available for other processes to reserve. When a guarantee expires but the file is not closed, the file remains usable for ordinary I/O, with no guarantee of rate.