Chapter 8. Device Driver/Kernel Interface

Chapter 8. Device Driver/Kernel Interface
Prev	Part III. Kernel-Level Drivers	Next

The programming interface between a device driver and the IRIX kernel is founded on the AT&T System V Release 4 DDI/DKI, and it remains true that a working device driver for an SVR4 system can be ported to IRIX with relatively little difficulty. However, as both SGI hardware and the IRIX kernel have evolved into far greater complexity and sophistication, the driver interface has been extended. A driver can now call upon nearly as many IRIX extended kernel functions as it can SVR4-compatible ones.

The function prototypes and detailed operation of all kernel functions are documented in the reference pages in volume “D.” The aim of this chapter is to provide background, context, and an overview of the interface under the following headings:

“Important Data Types” describes the data types that are exchanged between the kernel and a driver.
“Important Header Files” summarizes the C header files that are frequently included in a driver source file.
“Kernel Memory Allocation” discusses allocating kernel memory in general and for objects of specific types.
“Transferring Data” discusses the problems of copying data between user and kernel address spaces, and block-copy operations within the kernel.
“Managing Virtual and Physical Addresses” discusses functions for testing and translating addresses in different spaces, for using address/length lists, and for setting up DMA transfers.
“Hardware Graph Management” discusses the kernel function used to create and modify hwgraph vertexes.
“User Process Administration” tells how to test the attributes of a calling process and how to send a signal.
“Waiting and Mutual Exclusion” details the kinds of locks and semaphores available, and the methods of waiting for events to occur.

Important Data Types

In order to understand the driver/kernel interface, you need first of all to understand the data types with which it deals.

Hardware Graph Types

As discussed under “Hardware Graph Features” in Chapter 2, the hwgraph is composed of vertexes connected by labelled edges. The functions for working with the hwgraph are discussed under “Hardware Graph Management”.

Vertex Handle Type

There is no data type associated with the edge as such. The data type of a graph vertex is the vertex_hdl_t, an opaque, 32-bit number. When you create a vertex, a vertex_hdl_t is returned. When you store data in a vertex, or get data from one, you pass a vertex_hdl_t as the argument.

Vertex Handle and dev_t

The device number type, dev_t, is an important type in classical driver design (see “Device Number Types”). In IRIX 6.4, the dev_t and the vertex_hdl_t are identical. That is, when a driver is called to open or operate a device that is represented as a vertex in the hardware graph, the value passed to identify the device is simply the handle to the hwgraph vertex for that device.

When a driver is called to open a device that is only represented as a special file in /dev (as in IRIX 6.3 and earlier—there are no such devices supported by IRIX 6.4, but such support is provided for third-party drivers in IRIX 6.5), the identifying value is an o_dev_t, containing major and minor numbers and identical to the traditional dev_t.

Graph Error Numbers

Most hwgraph functions have graph error codes as their explicit result type. The graph_error_t is an enumeration declared in sys/graph.h (included by sys/hwgraph.h) having these values:

GRAPH_SUCCESS	Operation successful. This success value is 0, as is conventional in C programming.
GRAPH_DUP	Data to be added already exists.
GRAPH_NOT_FOUND	Data requested does not exist.
GRAPH_BAD_PARAM	Typically a null value where an address is required, or other unusable function parameter.
GRAPH_HIT_LIMIT	Arbitrary limit on, for example, number of edges.
GRAPH_CANNOT_ALLOC	Unable to allocate memory to expand vertex or other data structure, possibly because “no sleep” specified.
GRAPH_ILLEGAL_REQUEST	Improper or impossible request.
GRAPH_IN_USE	Cannot deallocate vertex because there are references to it.

Address Types

Device drivers deal with addresses in different address spaces. When you store individual addresses, it is a good idea to use a data type specific to the address space. The following types are declared in sys/types.h to use for pointer variables:

caddr_t	Any memory (“core”) address in user or kernel space.
daddr_t	A disk offset or address (64 bits).
paddr_t	A physical memory address.
iopaddr_t	An address in some I/O bus address space.

It is a very good idea to always store a pointer in a variable with the correct type. It makes the intentions of the program more understandable, and helps you think about the complexities of address translation.

Address/Length Lists

An address/length list, or alenlist, is a software object you use to store the address and size of each segment of a buffer. An alenlist is a list in which each list item is a pair composed of an address and a related length. All the addresses in the list refer to the same address space, whether that is a user virtual space, the kernel virtual space, physical memory space, or the address space of some I/O bus. An alenlist cursor is a pointer that ranges over the list, selecting one pair after another.

Figure 8-1. Address/Length List Concepts

The conceptual relationship between an alenlist and a buffer is illustrated in Figure 8-1. A buffer area that is a single contiguous segment in virtual memory may consist of scattered page frames in physical memory. The alenlist_t data type is a pointer to an alenlist.

The kernel provides a variety of functions for creating alenlists, for loading them with addresses and lengths, and for translating the addresses (see “Using Address/Length Lists”). These functions and the alenlist_t data type are declared in sys/alenlist.h.

Structure uio_t

The uio_t structure describes data transfer for a character device:

The pfxread() and pfxwrite() entry points receive a uio_t that describes the buffer of data.
Within an pfxioctl() entry point, you might construct a uio_t to represent data transfer for control purposes.
In a hybrid character/block driver, the physiock() function translates a uio_t into a buf_t for use by the pfxstrategy() entry point.

The fields and values in a uio_t are declared in sys/uio.h, which is included by sys/ddi.h. For a detailed discussion, see the uio(D4) reference page. Typically the contents of the uio_t reflect the buffer areas that were passed to a read(), readv(), write(), or writev() call (see the read(2) and write(2) reference pages).

Data Location and the iovec_t

One uio_t describes data transfer to or from a single address space, either the address space of a user process or the kernel address space. The address space is indicated by a flag value, either UIO_USERSPACE or UIO_SYSSPACE, in the uio_segflg field.

The total number of bytes remaining to be transferred is given in field uio_resid. Initially this is the total requested transfer size.

Although the transfer is to a single address space, it can be directed to multiple segments of data within the address space. Each segment of data is described by a structure of type iovec_t. An iovec_t contains the virtual address and length of one segment of memory.

The number of segments is given in field uio_iovcnt. The field uio_iov points to the first iovec_t in an array of iovec_t structures, each describing one segment. of data. The total size in uio_resid is the sum of the segment sizes.

For a simple data transfer, uio_iovcnt contains 1, and uio_iov points to a single iovec_t describing a buffer of 1 or more bytes. For a complicated transfer, the uio_t might describe a number of scattered segments of data. Such transfers can arise in a network driver where multiple layers of message header data are added to a message at different levels of the software.

Use of the uio_t

In the pfxread() and pfxwrite() entry points, you can test uio_segflag to see if the data is destined for user space or kernel space, and you can save the initial value of uio_resid as the requested length of the transfer.

In a character driver, you fetch or store data using functions that both use and modify the uio_t. These functions are listed under “Transferring Data Through a uio_t Object”. When data is not immediately available, you should test for the FNDELAY or FNONBLOCK flags in uio_fmode, and return when either is set rather than sleeping.

Structure buf_t

The buf_t structure describes a block data transfer. It is designed to represent the transfer (in or out) of a sequence of adjacent, fixed-size blocks from a random-access device to a block of contiguous memory. The size of one device block is NBPSCTR, declared in sys/param.h. For a detailed discussion of the buf_t, see the buf(D4) reference page.

The buf_t is used internally in IRIX by the paging I/O system to manage queues of physical pages, and by filesystems to manage queues of pages of file data. The paging system and filesystems are the primary clients of the pfxstrategy() entry point to a block device driver, so it is only natural that a buf_t pointer is the input argument to pfxstrategy().

Tip: The idbg kernel debugging tool has several functions related to displaying the contents of buf_t objects. See “Commands to Display buf_t Objects” in Chapter 10.

Fields of buf_t

The fields of the buf_t are declared in sys/buf.h, which is included by sys/ddi.h. This header file also declares the names of many kernel functions that operate on buf_t objects. (Many of those functions are not supported as part of the DDI/DKI. You should only use kernel functions that have reference pages.)

Because buf_t is used by so many software components, it has many fields that are not relevant to device driver needs, as well as some fields that have multiple uses. The relevant fields are summarized in Table 8-1.

Table 8-1. Accessible Fields of buf_t Objects

Field Name	Access	Purpose and Contents
b_edev	read-only	dev_t giving device major and minor numbers.
b_flags	read-only	Operational flags; for a detailed list see `buf(D4)` .
b_forw, b_back, av_forw, av_back	read-write	Queuing pointers, available for driver use within the pfx`strategy()` routine.
b_un.b_addr	read-only	Sometimes the kernel virtual address of the buffer, depending on the b_flags setting BP_ISMAPPED.
b_bcount	read-only	Number of bytes to transfer.
b_blkno	read-only	Starting logical block number on device (for a disk, relative to the partition that the device represents).
b_iodone	read-write	Address of a driver internal function to be called on I/O completion.
b_resid	read-write	Number of bytes not transferred, set at completion to 0 unless an error occurs.
b_error	read-write	Error code, set at completion of I/O.

No other fields of the buf_t are designed for use by a driver. In Table 8-1, “read-only” access means that the driver should never change this field in a buf_t that is owned by the kernel. When the driver is working with a buf_t that the driver has allocated (see “Allocating buf_t Objects and Buffers”) the driver can do what it likes.

Using the Logical Block Number

The logical block number is the number of the 512-byte block in the device. The “device” is encoded by the minor device number that you can extract from b_edev. It might be a complete device surface, or it might be a partition within a larger device (for example, the IRIX disk device drivers support different minor device numbers for different disk partitions).

The pfxstrategy() routine may have to translate the logical block number based on the driver's information about device partitioning and device geometry (sector size, sectors per track, tracks per cylinder).

Buffer Location and b_flags

The data buffer represented by a buf_t can be in one of two places, depending on bits in b_flags.

When the macro BP_ISMAPPED(buf_t-address) returns true, the buffer is in kernel virtual memory and its virtual address is in b_un.b_addr.

When BP_ISMAPPED(buf_t-address) returns false, the buffer is described by a chain of pfdat structures (declared in sys/pfdat.h, but containing no fields of any use to a device driver). In this case, b_un.b_addr contains only an offset into the first page frame of the chain. See “Managing Buffer Virtual Addresses” for a method of mapping an unmapped buffer.

Lock and Semaphore Types

The header files sys/sema.h and sys/types.h declare the data types of locks of different types, including the following:

lock_t	Basic lock, or spin-lock, used with LOCK() and related functions.
mutex_t	Sleeping lock, used for mutual exclusion between upper-half instances.
sema_t	Semaphore object, used for general locking.
mrlock_t	Reader-writer locks, used with RW_RDLOCK() and related functions.
sv_t	Synchronization variable, used with SV_WAIT and related functions.

These lock types should be treated as opaque objects because their contents can change from release to release (and in fact their contents are different in IRIX 6.2 from previous releases).

The families of locking and synchronization functions contain functions for allocating, initializing, and freeing each type of lock. See “Waiting and Mutual Exclusion”.

Device Number Types

In the /dev filesystem (but not in the /hw filesystem), two numbers are carried in the inode of a device special file: a major device number of up to 9 bits, and a minor device number of up to 18 bits. The numbers are assigned when the device special file is created, either by the /dev/MAKEDEV script or by the system administrator. The contents and meaning of device numbers is discussed under “Devices as Files” in Chapter 2.

In traditional UNIX practice, the dev_t has been an unsigned integer containing the values of the major and minor numbers for the device that is to be used. When a device is represented in IRIX only as a device special file in /dev, this is still the case.

When a device is represented by a vertex of the hwgraph, visible as a name in the /hw filesystem, the major number is always 0 and the minor number is arbitrary. When a device is opened as a special file in /hw, the dev_t received by the driver is composed of major 0 and an arbitrary minor number. In fact, the dev_t is a vertex_hdl_t, a handle to the hwgraph vertex that represents the device.

Historical Use of the Device Numbers

Historically, a driver used the major device number to learn which device driver has been called. This was important only when the driver supported multiple interfaces, for example both character and block access to the same hardware.

Also historically, the driver used the minor device number to distinguish one hardware unit from another. A typical scheme was to have an array of device-information structures indexed by the minor number. In addition, mode of use options were encoded in the minor number, as described under “Minor Device Number” in Chapter 2.

You can still use major and minor numbers the same way, but only when the device is represented by a device special file that is created with the mknod command, so that it contains meaningful major and minor numbers. The kernel functions related to dev_t use are summarized in Table 8-2.

Table 8-2. Functions to Manipulate Device Numbers

Function	Header Files	Purpose
`etoimajor(D3)`	ddi.h	Convert external to internal major device number.
`getemajor(D3)`	ddi.h	Get external major device number.
`geteminor(D3)`	ddi.h	Get external minor device number.
`getmajor(D3)`	ddi.h	Get internal major device number.
`getminor(D3)`	ddi.h	Get internal minor device number.
`itoemajor(D3)`	ddi.h	Convert internal to external major device number.
`makedevice(D3)`	ddi.h	Make device number from major and minor numbers.

The most important of the functions in Table 8-2 are

getemajor(), which extracts the major number from a dev_t and returns it as a major_t
geteminor(), which extracts the minor number from a dev_t and returns it as a minor_t

The makedevice() function, which combines a major_t and a minor_t to form a traditional dev_t, is useful only when creating a “clone” driver (see “Support for CLONE Drivers” in Chapter 22).

Contemporary Device Number Use

When the device is represented as a hwgraph vertex, the driver does not receive useful major and minor numbers. Instead, the driver uses the device-unique information that the driver itself has stored in the hwgraph vertex.

An historical driver makes only historical use of the dev_t, using the functions listed in the preceding topic. Such a driver makes no use of the hwgraph, and can only manage devices that are opened as device special files in /dev.

A contemporary driver creates hwgraph vertexes to represent its devices (see “Extending the hwgraph”); makes no use of the major and minor device numbers; and uses the dev_t as a handle to a hwgraph vertex. Such a driver can only manage devices that are opened as device special files in /hw, or devices that are opened through symbolic links in /dev that refer to /hw.

It might possibly be necessary to merge the two approaches. This can be done as follows. In each upper-half entry point, apply getemajor() to the dev_t. When the result is nonzero, the dev_t is conventional and geteminor() will return a useful minor number. Use it to locate the device-specific information.

When getemajor() returns 0, the dev_t is a vertex handle. Use device_info_get() to retrieve the address of device-specific information.

Important Header Files

The header files that are frequently needed in device driver source modules are summarized in Table 8-3.

Table 8-3. Header Files Often Used in Device Drivers

Header File	Reason for Including
`sys/alenlist.h`	The address/length list type and related functions.
`sys/buf.h`	The buf_t structure and related constants and functions (included by `sys/ddi.h`).
`sys/cmn_err.h`	The `cmn_err()` function.
`sys/conf.h`	The constants used in the pfx`devflags` global.
`sys/ddi.h`	Many kernel functions declared. Also includes `sys/types.h`, `sys/uio.h`, and `sys/buf.h`.
`sys/debug.h`	Defines the ASSERT macro and others.
`sys/dmamap.h`	Data types and kernel functions related to DMA mapping.
`sys/edt.h`	Declares the edt_t type passed to pfx`edtinit()`.
`sys/eisa.h`	EISA-bus hardware constants and EISA kernel functions.
`sys/errno.h`	Names for all system error codes.
`sys/file.h`	Names for file mode flags passed to driver entry points.
`sys/hwgraph.h`	Hardware graph objects and related functions.
`sys/immu.h`	Types and macros used to manage virtual memory and some kernel functions.
`sys/kmem.h`	Constants like KM_SLEEP used with some kernel functions.
`sys/ksynch.h`	Functions used for sleep-locks.
`sys/log.h`	Types and functions for using the system log.
s`ys/major.h`	Names for assigned major device numbers.
`sys/mman.h`	Constants and flags used with `mmap()` and the pfx`mmap()` entry point.
`sys/param.h`	Constants like PZERO used with some kernel functions.
s`ys/PCI/pciio.h`	PCI bus interface functions and constants.
`sys/pio.h`	VME PIO functions.
`sys/poll.h`	Types and functions for pollhead allocation and poll callback.
`sys/scsi.h`	Types and functions used to call the inner SCSI driver.
`sys/sema.h`	Types and functions related to semaphores, mutex locks, and basic locks.
`sys/stream.h`	STREAMS standard functions and data types.
`sys/strmp.h`	STREAMS multiprocessor functions.
`sys/sysmacros.h`	Macros for conversion between bytes and pages, and similar values.
`sys/systm.h`	Kernel functions related to system operations.
`sys/types.h`	Common data types and types of system objects (included by `sys/ddi.h`).
`sys/uio.h`	The uio_t structure and related functions (included by `sys/ddi.h`).
`sys/vmereg.h`	VME bus hardware constants and VME-related functions.

Kernel Memory Allocation

A device or STREAMS driver can allocate memory statically, as global variables in the driver module, and this is a good way to allocate any object that is always needed and has a fixed size.

When the number or size of an object can vary, but can be determined at initialization time, the driver can allocate memory in the pfxinit(), pfxedtinit(), pfxattach(), or pfxstart() entry point.

You can allocate memory dynamically in any upper-half entry point. When this is necessary, it should be done in an entry point that is called infrequently, such as pfxopen(). The reason is that memory allocation is subject to unpredictable delays. As a general rule, you should avoid the need to allocate memory in an interrupt handler.

General-Purpose Allocation

General-purpose allocation uses the kmem_alloc() function and associated functions summarized in Table 8-4.

Table 8-4. Functions for Kernel Virtual Memory

Function Name	Header Files	Purpose
`kmem_alloc(D3)`	kmem.h & types.h	Allocate space from kernel free memory.
`kmem_free(D3)`	kmem.h & types.h	Free previously allocated kernel memory.
`kmem_zalloc(D3)`	kmem.h & types.h	Allocate and clear space from kernel free memory.

The most important of these functions is kmem_alloc(). You use it to allocate blocks of virtual memory at any time. It offers these important options, controlled by a flag argument:

Sleeping or not sleeping when space is not available. You specify not-sleeping when holding a basic lock, but you must be prepared to deal with a return value of NULL.
Physically-contiguous memory. The memory allocated is virtual, and when it spans multiple pages, the pages are not necessarily adjacent in physical memory. You need physically contiguous pages when doing DMA with a device that cannot do scatter/gather. However, contiguous memory is harder to get as the system runs, so it is best to obtain it in an initialization routine.
Cache-aligned memory. By requesting memory that is a multiple of a cache line in size, and aligned on a cache-line boundary, you ensure that DMA operations will affect the fewest cache lines (see “Setting Up a DMA Transfer”).

The kmem_zalloc() function takes the same options, but offers the additional service of zero-filling the allocated memory.

In porting an old driver you may find use of allocation calls beginning with “kern.” Calls to the “kern” group of functions should be upgraded as follows:

`kern_malloc`(n)	Change to `kmem_alloc`(n,KM_SLEEP).
`kern_calloc`(n,s)	Change to `kmem_zalloc`(ns*,KM_SLEEP)
`kern_free`(p)	Change to `kmem_free`(p)

Allocating Memory in Specific Nodes of a Origin2000 System

In the nonuniform memory of a Origin2000 system, there is a time penalty for access to memory that is physically located in a node different from the node where the code is executing. However, kmem_alloc() attempts to allocate memory in the same node where the caller is executing. The pfxedtinit() and pfxattach() entry points execute in the node that is closest to the hardware device. If you allocate per-device structures in these entry points using kmem_alloc(), the structures will normally be in memory on the same node as the device. This provides the best performance for the interrupt handler, which also executes in the closest node to the device.

Other upper-half entry points execute in the node used by the process that called the entry point. If you allocate memory in the open() entry point, for example, that memory will be close to the user process.

When it is essential to allocate memory in a specific node and to fail if memory in that node is not available, you can use one of the functions summarized in Table 8-5.

Table 8-5. Functions for Kernel Memory In Specific Nodes

Function Name	Header Files	Purpose
kmem_alloc_node()	kmem.h & types.h	Allocate space from kernel free memory in specific node.
kmem_zalloc_node()	kmem.h & types.h	Allocate and clear space from kernel free memory in specific node.

These functions are available in all systems. In systems with a uniform memory, they behave the same as the normal kernel allocation functions.

Allocating Objects of Specific Kinds

The kernel provides a number of functions with the purpose of allocating and freeing objects of specific kinds. Many of these are variants of kmem_alloc() and kmem_free(), but others use special techniques suited to the type of object.

Allocating pollhead Objects

Table 8-6 summarizes the functions you use to allocate and free the pollhead structure that is used within the pfxpoll() entry point (see “Entry Point poll()” in Chapter 7). Typically you would call phalloc() while initializing each minor device, and call phfree() in the pfxunload() entry point.

Table 8-6. Functions for Allocating pollhead Structures

Function Name	Header Files	Purpose
`phalloc(D3)`	ddi.h & kmem.h & poll.h	Allocate and initialize a pollhead structure.
`phfree(D3)`	ddi.h & poll.h	Free a pollhead structure.

Allocating Semaphores and Locks

There are symmetrical pairs of functions to allocate and free all types of lock and synchronization objects. These functions are summarized together with the other locking functions under “Waiting and Mutual Exclusion”.

Allocating buf_t Objects and Buffers

The argument to the pfxstrategy() entry point is a buf_t structure that describes a buffer (see “Entry Point strategy()” in Chapter 7 and “Structure buf_t”).

Ordinarily, both the buf_t and the buffer are allocated and initialized by the kernel or the filesystem that calls pfxstrategy(). However, some drivers need to create a buf_t and associated buffer for special uses. The functions summarized in Table 8-7 are used for this.

Table 8-7. Functions for Allocating buf_t Objects and Buffers

Function Name	Header Files	Purpose
`geteblk(D3)`	ddi.h	Allocate a buf_t and a buffer of 1024 bytes.
`ngeteblk(D3)`	ddi.h	Allocate a buf_t and a buffer of specified size.
`brelse(D3)`	ddi.h	Return a buffer header and buffer to the system.
`getrbuf(D3)`	ddi.h	Allocate a buf_t with no buffer.
`freerbuf(D3)`	ddi.h	Free a buf_t with no buffer.

To allocate a buf_t and its associated buffer in kernel virtual memory, use either geteblk() or ngeteblk(). Free this pair of objects using brelse(), or by calling biodone().

You can allocate a buf_t to describe an existing buffer—one in user space, statically allocated in the driver, or allocated with kmem_alloc()—using getrbuf(). Free such a buf_t using freerbuf().

Transferring Data

The device driver executes in the kernel virtual address space, but it must transfer data to and from the address space of a user process. The kernel supplies two kinds of functions for this purpose:

functions that transfer data between driver variables and the address space of the current process
functions that transfer data between driver variables and the buffer described by a uio_t object

Warning: The use of an invalid address in kernel space with any of these functions causes a kernel panic.

All functions that reference an address in user process space can sleep, because the page of process space might not be resident in memory. As a result, such functions cannot be used while holding a basic lock, and should be avoided in an interrupt handler.

General Data Transfer

The kernel supplies functions for clearing and copying memory within the kernel virtual address space, and between the kernel address space and the address space of the user process that is the current context. These general-purpose functions are summarized in Table 8-8.

Table 8-8. Functions for General Data Transfer

Function Name	Header Files	Purpose
`bcopy(D3)`	ddi.h	Copy data between address locations in the kernel.
`bzero(D3)`	ddi.h	Clear memory for a given number of bytes.
`copyin(D3)`	ddi.h	Copy data from a user buffer to a driver buffer.
`copyout(D3)`	ddi.h	Copy data from a driver buffer to a user buffer.
`fubyte(D3)`	systm.h & types.h	Load a byte from user space.
`fuword(D3)`	systm.h & types.h	Load a word from user space.
`hwcpin(D3)`	systm.h & types.h	Copy data from device registers to kernel memory.
`hwcpout(D3)`	systm.h & types.h	Copy data from kernel memory to device registers.
`subyte(D3)`	systm.h & types.h	Store a byte to user space.
`suword(D3)`	systm.h & types.h	Store a word to user space.

Block Copy Functions

The bcopy() and bzero() functions are used to copy and clear data areas within the kernel address space, for example driver buffers or work areas. These are optimized routines that take advantage of available hardware features.

The bcopy() function is not appropriate for copying data between a buffer and a device; that is, for copying between virtual memory and the physical memory addresses that represent a range of device registers (or indeed any uncached memory). The reason is that bcopy() uses doubleword moves and any other special hardware features available, and devices many not be able to accept data in these units. The hwcpin() and hwcpout() functions copy data in 16-bit units; use them to transfer bulk data between device space and memory. (Use simple assignment to move single words or bytes.)

The copyin() and copyout() functions take a kernel virtual address, a process virtual address, and a length. They copy the specified number of bytes between the kernel space and the user space. They select the best algorithm for copying, and take advantage of memory alignment and other hardware features.

If there is no current context, or if the address in user space is invalid, or if the address plus length is not contained in the user space, the functions return -1. This indicates an error in the request passed to the driver entry point, and the driver normally returns an EFAULT error.

Byte and Word Functions

The functions fubyte(), subyte(), fuword(), and suword() are used to move single items to or from user space. When only a single byte or word is needed, these functions have less overhead than the corresponding copyin() or copyout() call. For example you could use fuword() to pick up a parameter using an address passed to the pfxioctl() entry point. When transferring more than a few bytes, a block move is more efficient.

Transferring Data Through a uio_t Object

A uio_t object defines a list of one or more segments in the address space of the kernel or a user process (see “Structure uio_t”). The kernel supplies three functions for transferring data based on a uio_t, and these are summarized in Table 8-9.

Table 8-9. Functions Moving Data Using uio_t

Function	Header Files	Purpose
`uiomove(D3)`	ddi.h	Copy data using uio_t.
`ureadc(D3)`	ddi.h	Copy a character to space described by uio_t.
`uwritec(D3)`	ddi.h	Return a character from space described by uio_t.

The uiomove() function moves multiple bytes between a buffer in kernel virtual space—typically, a buffer owned by the driver—and the space or spaces described by a uio_t. The function takes a byte count and a direction flag as arguments, and uses the most efficient mechanism for copying.

The ureadc() and uwritec() functions transfer only a single byte. You would use them when transferring data a byte at a time by PIO. When moving more than a few bytes, uiomove() is faster.

All of these functions modify the uio_t to reflect the transfer of data:

uio_resid is decremented by the amount moved
In the iovec_t for the current segment, iov_base is incremented and iov_len is decremented
As segments are used up, uio_iov is incremented and uio_iovcnt is decremented

The result is that the state of the uio_t always reflects the number of bytes remaining to transfer. When the pfxread() or pfxwrite() entry point returns, the kernel uses the final value of ui_resid to compute the count returned to the read() or write() function call.

Managing Virtual and Physical Addresses

The kernel supplies functions for querying the address of hardware registers and for performing memory mapping. The most helpful of these functions involve the use of address/length lists.

Managing Mapped Memory

The pfxmap() and pfxunmap() entry points receive a vhandl_t object that describes the region of user process space to be mapped. The functions summarized in Table 8-10 are used to manipulate that object.

Table 8-10. Functions to Manipulate a vhandl_t Object

Function Name	Header Files	Purpose
`v_getaddr(D3)`	ddmap.h & types.h	Get the user virtual address associated with a vhandl_t.
`v_gethandle(D3)`	ddmap.h & types.h	Get a unique identifier associated with a vhandl_t.
`v_getlen(D3)`	ddmap.h & types.h	Get the length of user address space associated with a vhandl_t.
`v_mapphys(D3)`	ddmap.h & types.h	Map kernel address space into user address space.

The v_mapphys() function actually performs a mapping between a kernel address and a segment described by a vhandl_t (see “Entry Point map()” in Chapter 7).

The v_getaddr() function has hardly any use except for logging and debugging. The address in user space is normally undefined and unusable when the pfxmap() entry point is called, and mapped to kernel space when pfxunmap() is called. The driver has no practical use for this value.

The v_getlen() function is useful only in the pfxunmap() entry point—the pfxmap() entry point receives a length argument specifying the desired region size.

The v_gethandle() function returns a number that is unique to this mapping (actually, the address of a page table entry). You use this as a key to identify multiple mappings, so that the pfxunmap() entry point can properly clean up.

Caution: Be careful when mapping device registers to a user process. Memory protection is available only on page boundaries, so configure the addresses of I/O cards so that each device is on a separate page or pages. When multiple devices are on the same page, a user process that maps one device can access all on that page. This can cause system security problems or other problems that are hard to diagnose.

Note: In previous releases of IRIX, the header file sys/region.h contained these functions. As of IRIX 6.5, the header file sys/region.h is removed and these same functions are declared in ksys/ddmap.h.

Working With Page and Sector Units

In a 32-bit kernel, the page size for memory and I/O is 4 KB. In a 64-bit kernel, the memory page size is typically 16 KB, but can vary. Also, the size of “page” used for I/O operations can be different from the size of page used for virtual memory. Because of hardware constraints in Challenge and Onyx systems, a 4 KB page is used for I/O operations in these machines.

The header files sys/immu.h and sys/sysmacros.h contain constants and macros for working with page units. Some of the most useful are listed in Table 8-11.

Table 8-11. Constants and Macros for Page and Sector values

Function Name	Header File	Purpose
BBSIZE	`param.h`	Size of a “basic block,” the assumed disk sector size (512).
BTOBB(bytes)	`param.h`	Converts byte count to basic block count, rounding up.
BTOBBT(bytes)	`param.h`	Converts byte count to basic block count, truncating.
OFFTOBB(bytes)	`param.h`	Converts off_t count to basic blocks, rounding.
OFFTOBBT(bytes)	`param.h`	Converts off_t count to basic blocks, truncating.
BBTOOFF(bbs)	`param.h`	Converts count of basic blocks to an off_t byte count.
`NBPP`	`immu.h`	Number of bytes in a virtual memory page (defined from _PAGESZ; see “Compiler Variables” in Chapter 9 ).
IO_NBPP	`immu.h`	Number of bytes in an I/O page, can differ from NBPP.
io_numpages(addr, len)	`sysmacro s.h`	Number of I/O pages that span a given address for a length.
io_ctob(x)	`sysmacro s.h`	Return number of bytes in x I/O pages (rounded up).
io_ctobt(x)	`sysmacro s.h`	Return number of bytes in x I/O pages (truncated).

The names listed in Table 8-11 are defined at compile-time. If you use them, the binary object file is dependent on the compile-time variables for the chosen platform, and may not run on a different platform.

The operations summarized in Table 8-12 are provided as functions. Use of them does not commit your driver to a particular platform.

Table 8-12. Functions to Convert Bytes to Sectors or Pages

Function Name	Header Files	Purpose
`btop(D3)`	ddi.h	Return number of virtual pages in a byte count (truncate).
`btopr(D3)`	ddi.h	Return number of virtual pages in a byte count (round up).
`ptob(D3)`	ddi.h	Convert size in virtual pages to size in bytes.

When examining an existing driver, be alert for any assumption that a virtual memory page has a particular size, or that an I/O page is the same size as a memory page.

Using Address/Length Lists

The concepts behind alenlists are described under “Address/Length Lists” and in more detail in the reference page alenlist(d4x).

You can use alenlists to unify the handling of buffer addresses of all kinds. In general you use an alenlist as follows:

Create the alenlist object, either with an explicit function call or implicitly as part of filling the list.
Fill the list with addresses and lengths to describe a buffer in some address space.
Apply a translation function to translate all the addresses into the address space of an I/O bus.
Use an alenlist cursor to read out the translated address/length pairs, and program them into a device so it can do DMA.

Creating Alenlists

The functions summarized in Table 8-13 are used to explicitly create and manage alenlists. For details see reference page alenlist_ops(d3x).

Table 8-13. Functions to Explicitly Manage Alenlists

Function Name	Header Files	Purpose
alenlist_create()	alenlist.h	Create an empty alenlist.
alenlist_destroy()	alenlist.h	Release memory of an alenlist.
alenlist_clear()	alenlist.h	Empty an alenlist.

Typically you create an alenlist implicitly, as a side-effect of loading it (see next topic). However you can use alenlist_create() to create an alenlist. Then you can be sure that there will never be an unplanned delay for memory allocation while using the list.

Whenever the driver is finished with an alenlist, release it using alenlist_destroy().

Loading Alenlists

The functions summarized in Table 8-14 are used to populate an alenlist with one or more address/length pairs to describe memory.

Table 8-14. Functions to Populate Alenlists

Function Name	Header Files	Purpose
buf_to_alenlist()	alenlist.h	Fill an alenlist with entries that describe the buffer controlled by a buf_t object.
kvaddr_to_alenlist()	alenlist.h	Fill an alenlist with entries that describe a buffer in kernel virtual address space.
uvaddr_to_alenlist()	alenlist.h	Fill an alenlists with entries that describe a buffer in a user virtual address space.
alenlist_append()	alenlist.h	Add a specified address and length as an item to an existing alenlist.

Each of the functions buf_to_alenlist(), kvaddr_to_alenlist(), and uvaddr_to_alenlist() take an alenlist address as their first argument. If this address is NULL, they create a new list and use it. If the input list is too small, any of the functions in Table 8-14 can allocate a new list with more entries. Either of these allocations may sleep. In order to avoid an unplanned delay, you can create an alenlist in advance, fill it to a planned size with null items, and clear it.

The functions buf_to_alenlist(), kvaddr_to_alenlist(), and uvaddr_to_alenlist() add entries to an alenlist to describe the physical address of a buffer. Before using uvaddr_to_alenlist() you must be sure that the pages of the user buffer are locked into memory (see “Converting Virtual Addresses to Physical”).

Translating Alenlists

The kernel support for the PCI bus includes functions that translate an entire alenlist from physical memory addresses to corresponding addresses in the address space of the target bus. For PCI functions see “Mapping an Address/Length List” in Chapter 21.

Using Alenlist Cursors

You use a cursor to read out the address/length pairs from an alenlist. The cursor management functions are summarized in Table 8-15 and detailed in reference page alenlist_ops(d3x).

Table 8-15. Functions to Manage Alenlist Cursors

Function Name	Header Files	Purpose
alenlist_cursor_create()	alenlist.h	Create an alenlist cursor and associate it with a specified list.
alenlist_cursor_init()	alenlist.h	Set a cursor to point at a specified list item.
alenlist_cursor_destroy()	alenlist.h	Release memory of a cursor.

Each alenlist includes a built-in cursor. If you know that only one process or thread is using the alenlist, you can use this built-in cursor. When more than one process or thread might use the alenlist, each must create an explicit cursor. A cursor is associated with one alenlist and must always be used with that alenlist.

The functions that retrieve data based on a cursor are summarized in Table 8-16.

Table 8-16. Functions to Use an Alenlist Based on a Cursor

Function Name	Header Files	Purpose
alenlist_get()	alenlist.h	Retrieve the next sequential address and length from a list.
alenlist_cursor_offset(D3)	alenlist.h	Query the effective byte offset of a cursor in the buffer described by its list.

The alenlist_get() function is the key function for extracting data from an alenlist. Each call returns one address and its associated length. However, these address/length pairs are not required to match exactly to the items in the list. You can extract address/length pairs in smaller units. For example, suppose the list contains address/length pairs that describe 4 KB pages. You can read out sequential address/length pairs with maximum lengths of 512 bytes, or any other smaller length. The cursor remembers the position in the list to the byte level.

You pass to alenlist_get() a maximum length to return. When that is 0 or large, the function returns exactly the address/length pairs in the list. When the maximum length is smaller than the current address/length pair, the function returns the address and length of the next sequential segment not exceeding the maximum. In addition, when the maximum length is an integral power of two, the function restricts the returned length so that the returned segment does not cross an address boundary of the maximum length.

These features allow you to read out units of 512 bytes (for example), never crossing a 512-byte boundary, from a list that contains address/length pairs in other lengths. The alenlist_cursor_offset() function returns the byte-level offset between the first address in the list and the next address the cursor will return.

Setting Up a DMA Transfer

A DMA transfer is performed by a programmable I/O device, usually called bus master (see “Direct Memory Access” in Chapter 1). The driver programs the device with the length of data to transfer, and with a starting address. Some devices can be programmed with a list of addresses and lengths; these devices are said to have scatter/gather capability.

There are two issues in preparing a DMA transfer:

Calculating the addresses to be programmed into the device registers. These addresses are the bus addresses that will properly target the memory buffers.
In a uniprocessor, ensuring cache coherency. A multiprocessor handles cache coherency automatically.

The most effective tool for creation of target addresses is the address/length list (see “Using Address/Length Lists”, the preceding topic):

You collect the addresses and lengths of the parts of the target buffer in an alenlist.
You apply a single translation function to replace that alenlist with one whose contents are based on bus virtual addresses.
You use an alenlist cursor to read out addresses and lengths in unit sizes appropriate to the device, and program these into the device using PIO.

The functions you use to translate the addresses in an alenlist are different for different bus adapters, and are discussed in the following chapters:

The functions to set up DMA from a VME device are covered in Chapter 13, “Services for VME Drivers on Origin 2000/Onyx2”.
The functions to set up DMA from a SCSI device are covered in Chapter 16, “SCSI Device Drivers”.
The functions to set up DMA from a PCI device are covered in Chapter 20, “PCI Device Attachment”.

DMA Buffer Alignment

In some systems, the buffers used for DMA must be aligned on a boundary the size of a cache line in the current CPU. Although not all system architectures require cache alignment, it does no harm to use cache-aligned buffers in all cases. The size of a cache line varies among CPU models, but if you obtain a DMA buffer using the KMEM_CACHEALIGN flag of kmem_alloc(), the buffer is properly aligned. The buffer returned by geteblk() (see “Allocating buf_t Objects and Buffers”) is cache-aligned.

Why is cache alignment necessary? Suppose you have a variable, X, adjacent to a buffer you are going to use for DMA write. If you invalidate the buffer prior to the DMA write, but then reference the variable X, the resulting cache miss brings part of the buffer back into the cache. When the DMA write completes, the cache is stale with respect to memory. If, however, you invalidate the cache after the DMA write completes, you destroy the value of the variable X.

Maximum DMA Transfer Size

The maximum size for a single DMA transfer can be set by the system tuning variable maxdmasz, using the systune command (see the systune(1) reference page). A single I/O operation larger than this produces the error ENOMEM.

The unit of measure for maxdmasz is the page, which varies with the kernel. Under IRIX 6.2, a 32-bit kernel uses 4 KB pages while a 64-bit kernel uses 16 KB pages. In both systems, maxdmasz is shipped with the value 1024 decimal, equivalent to 4 MB in a 32-bit kernel and 16 MB in a 64-bit kernel.

Converting Virtual Addresses to Physical

There are no legitimate reasons for a device driver to convert a kernel virtual memory address to a physical address in IRIX 6.5. This translation is fraught with complexity and strongly dependent on the hardware of the system. For these and other reasons, the kernel provides a wide variety of address-translation functions that perform the kinds of translations that a driver requires.

In the simpler hardware architectures of past systems, there was a straightforward mapping between the addresses used by software and the addresses used by a bus master for DMA. This is no longer the case. Some of the complexities are sketched under the topic “PIO Addresses and DMA Addresses” in Chapter 1. In the Origin2000 architecture, the address used by a bus master can undergo two or three different translations on its way from the device to memory. There is no way in which a device driver can get the information to prepare the translated address for the device to use.

Instead, the driver uses translations based on opaque software objects such as PIO maps, DMA maps, and alenlists. Translations are bus-specific, and the functions for them are presented in the chapters on those buses.

You can load an alenlist with physical address/length pairs based on a kernel virtual address using buftoalenlist() (see “Loading Alenlists”). Some older drivers might still contain use of the kvtophys() function, which takes a kernel virtual address and returns the corresponding system bus physical address. This function is still supported (see the kvtophys(D3) reference page). However, you should be aware that the physical address returned is useless for programming an I/O device.

Managing Buffer Virtual Addresses

Block device drivers operate upon data buffers described by buf_t objects (see “Structure buf_t”). Kernel functions to manipulate buffer page mappings are summarized in Table 8-17.

Table 8-17. Functions to Map Buffer Pages

Function Name	Header Files	Purpose
`bp_mapin(D3)`	buf.h	Map buffer pages into kernel virtual address space, ensuring the pages are in memory and pinned.
`bp_mapout(D3)`	buf.h	Release mapping of buffer pages.
`clrbuf(D3)`	buf.h	Clear the memory described by a mapped-in buf_t.
`buf_to_alenlist(D3)`	alenlist.h	Fill an alenlist with entries that describe the buffer controlled by a buf_t object.
`undma(D3)`	ddi.h	Unlock physical memory after I/O complete.
`userdma(D3)`	ddi.h	Lock physical memory in user space.
`bptophys(D3)`	ddi.h	Get physical address of buffer data.
`getnextpg(D3)`	buf.h	Return pfdat structure for next page.
`pptophys(D3)`	buf.h	Return the physical address of a page described by a pfdat structure.

When a pfxstrategy() routine receives a buf_t that is not mapped into memory (see “Buffer Location and b_flags”), it must make sure that the pages of the buffer space are in memory, and it must obtain valid kernel virtual addresses to describe the pages. The simplest way is to apply the bp_mapin() function to the buf_t. This function allocates a contiguous range of page table entries in the kernel address space to describe the buffer, creating a mapping of the buffer pages to a contiguous range of kernel virtual addresses. It sets the virtual address of the first data byte in b_un.b_addr, and sets the flags so that BP_ISMAPPED() returns true—thus converting an unmapped buffer to a mapped case.

Note: The reference page for the userdma() function is out of date as shipped in IRIX 6.4. The correct prototype for this function, as coded in sys/buf.h, is

int userdma(void *usr_v_addr, size_t num_bytes, int rw, void *MBZ);

The fourth argument must be a zero. The return value is not the same as stated. The function returns 0 for success and a standard error code for failure.

Managing Memory for Cache Coherency

Some kernel functions used for ensuring cache coherency are summarized in Table 8-18.

Table 8-18. Functions Related to Cache Coherency

Function Name	Header Files	Purpose
`dki_dcache_inval(D3)`	systm.h & types.h	Invalidate the data cache for a given range of virtual addresses.
`dki_dcache_wb(D3)`	systm.h & types.h	Write back the data cache for a given range of virtual addresses.
`dki_dcache_wbinval(D3)`	systm.h & types.h	Write back and invalidate the data cache for a given range of virtual addresses.
`flushbus(D3)`	systm.h & types.h	Make sure contents of the write buffer are flushed to the system bus.

The functions for cache invalidation are essential when doing DMA on a uniprocessor. They cost very little to use in a multiprocessor, so it does no harm to call them in every system. You call them as follows:

Call dki_dcache_inval() prior to doing DMA input. This ensures that when you refer to the received data, it will be loaded from real memory.
Call dki_dcache_wb() prior to doing DMA output. This ensures that the latest contents of cache memory are in system memory for the device to load.
Call dki_dcache_wbinval() prior to a device operation that samples memory and then stores new data.

In the IP28 CPU you must invalidate the cache both before and after a DMA input; see “Uncached Memory Access in the IP26 and IP28” in Chapter 1.

The flushbus() function is needed because in some systems the hardware collects output data and writes it to the bus in blocks. When you write a small amount of data to a device through PIO, delay, then write again, the writes could be batched and sent to the device in quick succession. Use flushbus() after PIO output when it is followed by PIO input from the same device. Use it also between any two PIO outputs when the device is supposed to see a delay between outputs.

Testing Device Physical Addresses

A family of functions, summarized in Table 8-19, is used to test a physical address to find out if it represents a usable device register.

Table 8-19. Functions to Test Physical Addresses

Function Name	Header Files	Purpose
`badaddr(D3)`	systm.h	Test physical address for input.
`badaddr_val(D3)`	systm.h	Test physical address for input and return the input value received.
`wbadaddr(D3)`	systm.h	Test physical address for output.
`wbadaddr_val(D3)`	systm.h	Test physical address for output of specific value.
`pio_badaddr(D3)`	pio.h & types.h	Test physical address for input through a map.
`pio_badaddr_val(D3)`	pio.h & types.h	Test physical address for input through a map and return the input value received.
`pio_wbadaddr(D3)`	pio.h & types.h	Test physical address through a map for output.
`pio_wbadaddr_val(D3)`	pio.h & types.h	Test physical address through a map for output of specific value.

The functions return a nonzero value when the address is bad, that is, unusable. These functions are normally used in the pfxedtinit() entry point to verify the bus address values passed in from a VECTOR statement. They are only usable with VME devices.

Hardware Graph Management

A driver is concerned about the hardware graph in two different contexts:

When called at an operational entry point such as pfxopen(), pfxwrite(), or pfxmap(), the driver gets information about the device from the hwgraph.
When called to initialize a device at pfxedtinit() or pfxattach(), the driver extends the hwgraph with vertexes to represent the device, and stores device and inventory information in the hwgraph.

The hwgraph concepts and terms are covered under “Hardware Graph Features” in Chapter 2. You should also read the hwgraph(4) and hwgraph_intro(d4x) reference pages.

Interrogating the hwgraph

When a driver is called at an operational entry point, the first argument is always a dev_t. This value stands for the specific device on which the driver should work. In older versions of IRIX, the dev_t was an integer encoding the major and minor device numbers. In current IRIX, the device is opened through a path in /hw (or a symbolic link to /hw), and the dev_t is a handle to a vertex of the hwgraph—usually a vertex created by the device driver. The dev_t is used as input to the functions summarized in Table 8-20.

Table 8-20. Functions to Query the Hardware Graph

Function Name	Header Files	Purpose
device_info_get() (`hwgraph.dev(d3x)` )	hwgraph.h	Return device info pointer stored in vertex.
device_inventory_get_next() (`hwgraph.inv(d3x)` )	hwgraph.h	Retrieve inventory_t structures that have been attached to a vertex.
device_controller_num_get() (`hwgraph.inv(d3x)` )	hwgraph.h	Retrieve the Controller field of the first or only inventory_t structure in a vertex.
hwgraph_edge_get() (`hwgraph.edge(d3x)` )	hwgraph.h	Follow an edge by name to a destination vertex.
hwgraph_traverse()	hwgraph.h	Follow a path from a starting vertex to its destination.

When initializing the device, the driver stores the address of a device information structure in the vertex using device_info_set() (see “Allocating Storage for Device Information” in Chapter 7). This address can be retrieved using device_info_get(). Typical code at the beginning of any entry point resembles Example 8-1.

Example 8-1. Typical Code to Get Device Info

typedef struct devInfo_s {
... fields of data unique to one device ...
} devInfo_t;
pfx_entry(dev_t dev,...)
   devInfo_t *pdi = device_info_get(dev);
   if (!pdi) return ENODEV;
   MUTEX_LOCK(pdi->devLock); /* get exclusive use */
...

When the driver creates the vertexes for a device, the driver can attach inventory information. This can be read out later using device_inventory_get_next().

Extending the hwgraph

When a driver is called at the pfxattach() entry point, it receives a vertex handle for the point at which its device is connected to the system—for example, a vertex that represents a bus slot. When a driver is called at the pfxedtinit() entry point, it receives an edt_t from which it can extract a vertex handle that again represents the point at which this device is attached to the system (refer to “VME Device Naming” in Chapter 12, “Entry Point attach()” in Chapter 7 and “Entry Point edtinit()” in Chapter 7).

At these times, the driver has the responsibility of extending the hwgraph with at least one edge and vertex to provide access to this device. The label of the edge supplies a visible name that a user process can open. The vertex contains the inventory data and the driver's own device information. Often the driver needs to add multiple vertexes and edges. (For an example of how a SGI driver extends the hwgraph, see “SCSI Devices in the hwgraph” in Chapter 16.)

Construction Functions

The basic functions for constructing edges and vertexes are summarized in Table 8-21. The most commonly-used are hwgraph_char_device_add() and hwgraph_block_device_add(), functions that create leaf vertexes that users can open.

Table 8-21. Functions to Construct Edges and Vertexes

Function Name	Header Files	Purpose
device_info_set() (`hwgraph.dev(d3x)` )	hwgraph.h	Store the address of device information in a vertex.
device_inventory_add() (`hwgraph.inv(d3x)` )	invent.h	Add hardware inventory data to a vertex.
hwgraph_char_device_add() (`hwgraph.dev(d3x)` )	hwgraph.h	Create a character device special file under a specified vertex.
hwgraph_block_device_add() (`hwgraph.dev(d3x)` )	hwgraph.h	Create block device special file under a specified vertex.
hwgraph_vertex_create() (`hwgraph.vertex(d3x)` )	hwgraph.h	Create a new, empty vertex, and return its handle.
hwgraph_edge_add() (`hwgraph.edge(d3x)` )	hwgraph.h	Add a labelled edge between two vertexes.
hwgraph_edge_remove() (`hwgraph.edge(d3x)` )	hwgraph.h	Remove an edge by name from a vertex.

Extending the Graph With a Single Vertex

Suppose the kernel is probing a PCI bus and finds a veeble device plugged into slot 2. The kernel knows that a driver with the prefix veeble_ has registered to handle this type of device. The kernel calls veeble_attach(), passing the handle of the vertex that represents the point of attachment, which might be /hw/module/1/io/pci/slot/2.

Suppose that a veeble device permits only character-mode access and there are no optional modes of use. In this simple case, the driver needs to add only one vertex, a device special file connected by one edge having a label such as “veeble.” The result will be that the device can be opened under the pathname /hw/module/1/io/pci/slot/2/veeble. Parts of the code in veeble_attach() would resemble Example 8-2.

Example 8-2. Hypothetical Code for a Single Vertex

int veeble_attach(vertex_hdl_t vh)
{
   VeebleDevInfoStruct_t * vdis;
   vertex_hdl_t vv;
   graph_error_t ret;
   /* allocate memory for per-device structure */
   vdis = kmem_zalloc(sizeof(*vdis),KM_SLEEP);  
   if (!vdis) return ENOMEM;
   /* create device vertex below connect-point */ 
   ret = hwgraph_char_device_add(vh, "veeble", "veeble_", &vv);
   if (ret != GRAPH_SUCCESS)
      { kmem_free(vdis); return ret; }  
   /* here initialize contents of vdis->information struct */
   /* here initialize the device itself */
   /* set info struct in the device vertex */
   device_info_set(vv,vdis); 
   return 0;
}

In Example 8-2, the important variables are:

vh	Handle of the connection-point vertex passed to the function as a parameter.
vdis	Pointer to a structure of type “VeebleDevInfoStruct”—defined by the writer of this device driver to suit the application.
vv	Handle of the device vertex created by the function.

The steps performed are:

Allocate memory for a device information structure, and terminate with the standard ENOMEM return code if allocation is not possible.
Create a character device vertex, connected to vertex vh by an edge labelled “veeble,” storing the handle of the new vertex in vv. If this fails, free the info structure memory and return the same error.
Initialize the contents of the information structure: for example, initialize locks and flag values, and create PIO and/or DMA maps.
Initialize the device itself. Possibly set up an interrupt handler and an error handler (these operations are specific to the bus and the device).
Set the address of the initialized device information structure into the device vertex.

An additional step not shown is the storing of hardware inventory information that can be reported by hinv using device_inventory_add().

A point to note is that in a multiprocessor system, a user process could try to open the new “veeble” vertex as soon as (or even before) hwgraph_char_device_add() returns. This would result in an entry to the veeble_open() entry point of the driver, concurrent with the continued execution of the veeble_attach() entry point. However, note the two statements in Example 8-1:

   devInfo_t *pdi = device_info_get(dev);
   if (!pdi) return ENODEV;

At any time before veeble_attach() executes its call to device_info_set(), a call to veeble_open() for this vertex returns ENODEV. Needless to say, all the hwgraph functions are multiprocessor-aware and use locking as necessary to avoid race conditions.

Extending the Graph With Multiple Vertexes

In a more complicated case, a vooble device permits access as a block device or as a character device. The device should be accessible under names vooble/char and vooble/block. In this case the driver proceeds as follows:

Create a vertex to be the primary representation of the device using hwgraph_vertex_create().
Connect this primary vertex to the point of attachment with an edge named “vooble” using hwgraph_edge_add().
Add new vertexes, connected by edges “block” and “char” to the primary vertex using hwgraph_block_device_add() and hwgraph_char_device_add().

The subordinate block and character vertexes are device special files that can be opened by user code. Handles to these vertexes will be passed in to other driver entry points. There are a variety of ways to store device information in the three vertexes:

Store a pointer to a single information structure in both leaf vertexes.
Create separate “block” and “char” information structures and store one in each leaf vertex. Perhaps create a separate structure of information that is common to both block and character access, and point to it from both block and char structures.

As you plan this arrangement of data structures, bear in mind that the pfxopen() entry point receives a flag telling it whether the open is for block or character access (see “Entry Point open()” in Chapter 7); and that other entry points are called only for block, or only for character, devices.

Vertexes for Modes of Use

Possibly the device has multiple modes of use, as for example a tape device has byte-swapped and non-swapped access, fixed-block and variable-block access, and so on. Traditionally these modes of access were encoded in the device minor number as well as in the device name (see “Creating Conventional Device Names” in Chapter 2). Current practice is to create a separate vertex for each mode of use (see “Multiple Device Names” in Chapter 2).

When using the hwgraph, you represent each mode of access as a separate name in the /hw filesystem. Suppose that a PCI device of type flipper supports two modes of use, “flipped” and “flopped.” It is the job of the flipper_attach() entry point to set up hwgraph vertexes so that one device can be opened under different pathnames such as /hw/module/1/io/pci/slot/2/flipper/flipped and /hw/module/1/io/pci/slot/2/flipper/flopped. The problem is very similar to creating separate block and character vertexes for one device, with the additional problem that the device information stored in each vertex should reflect the desired mode of use, flipped or flopped. The code might resemble in part that shown in Example 8-3.

Example 8-3. Hypothetical Code for Multiple Vertexes

typedef struct flipperDope_s {
   vertex_hdl_t floppedMode; /* vertex for flopped */
   ...many other fields for management of one flipper dev...
} flipperDope_t;
int flipper_attach(vertex_hdl_t connv)
{
   flipperDope_t *pfd = NULL;
   vertex_hdl_t masterv = GRAPH_VERTEX_NONE;
   vertex_hdl_t flippedv = GRAPH_VERTEX_NONE;
   vertex_hdl_t floppedv = GRAPH_VERTEX_NONE;
   graph_error_t ret = 0;
   if (!pfd = kmem_zalloc(sizeof(*pfd),KM_SLEEP))  
   { ret = ENOMEM; goto done; }
   ret = hwgraph_vertex_create(&masterv);  
   if (ret != GRAPH_SUCCESS) goto done;
   ret = hwgraph_edge_add(connv,masterv,"flipper");  
   if (ret != GRAPH_SUCCESS) goto done;
   ret = hwgraph_char_device_add(masterv, "flipped", "flipper_", &flippedv);  
   if (ret != GRAPH_SUCCESS) goto done;
   ret = hwgraph_char_device_add(masterv, "flopped", "flipper_", &floppedv);
   if (ret != GRAPH_SUCCESS) goto done;
   pfd->floppedMode = floppedv; /* note which vertex is "flopped" */
...here initialize other fields of pfd->flipperDope...
   device_info_set(flippedv,pfd);
   device_info_set(floppedv,pfd);
done: /* If any error, undo all partial work */
   if (ret)
   {
      if (floppedv != GRAPH_VERTEX_NONE) hwgraph_vertex_destroy(floppedv);
      if (flippedv != GRAPH_VERTEX_NONE) hwgraph_vertex_destroy(flippedv);
      if (masterv != GRAPH_VERTEX_NONE)
      {
         hwgraph_edge_remove(rootv,"flipper",NULL);  
         hwgraph_vertex_destroy(masterv); 
      }
      if (pfd) kmem_free(pfd);  
   }
   return ret;
}

After successful completion of flipper_attach() there are two character special devices with paths /hw/.../flipper/flipped and /hw/.../flipper/flopped. A pointer to a single device information structure (a flipperDope_t object) is stored in both vertexes. However, the vertex handle of the flopped vertex is saved in the floppedMode field of the structure. Whenever the device driver is entered, it can retrieve the device information with a statement such as the following:

flipperDope_t *pfd = device_info_get(dev);

Whenever the driver needs to distinguish between “flipped” and “flopped” modes of access, it can do so with code such as the following:

if (dev == pfd->floppedMode)
{ ...this is flopped-mode...}
else
{ ...this is flipped-mode...}

Vertexes for User Convenience

The driver is allowed to create vertexes and attach them anywhere in the hwgraph. The connection point of a device is often at the end of a long path that is hard for a human to read or type. The driver can use hwgraph_vertex_create() and hwgraph_edge_add() to create a shorter, more readable path to any of the leaf vertexes it creates. For example, the hypothetical veeble_ driver of Example 8-2 might like to make the devices it attaches available via paths like /hw/veebles/1 and /hw/veebles/2.

At the time a driver is called to attach a device, the driver has no way to tell how many of these devices exist in the system. Also, recall that the pfxattach() entry point can be called concurrently on multiple CPUs to attach devices in different slots on different buses. The attach code has no basis on which to assign ordinal numbers to devices; that is, no way to know that a particular device is device 1, and another is device 2. These questions cannot be answered until the entire hardware complement has been found and attached.

The purpose of the ioconfig command is to call drivers one more time, before user processes start but after the hwgraph is complete, so they can create convenience vertexes. This use of ioconfig is described under “Device Management File” in Chapter 2. You direct ioconfig to assign controller numbers to your devices. After it does so, it opens each device (resulting in the first entry to pfxopen() for that device vertex), and optionally issues an ioctl against the open device passing a command number you specify. Upon either the first open of a device or in pfxioctl(), you can create convenience vertexes that include the assigned controller number of the device to make the names unique.

The assigned controller numbers are stable from one boot time to the next, so you can also create symbolic links in /dev naming them.

Attaching Information to Vertexes

The driver can attach several kinds of information to any vertex it creates:

Device information defined by the driver itself.
Hardware inventory information to be used by hinv.
Labelled attribute values.

The driver can also retrieve information that was set in the hwgraph by the administrator.

Attaching Device Information

The use of device_info_set() is discussed under two other topics: “Allocating Storage for Device Information” in Chapter 7 and “Extending the Graph With a Single Vertex”. Every device needs such an information structure—if for no other reason than to contain a lock used to ensure that each upper-half entry point has exclusive use of the device.

When the driver creates multiple vertexes for a particular device, the driver can store the same address in every vertex (as shown in Example 8-2 and Example 8-3). Yet another design option is to have each vertex contain the address of a small structure containing optional information unique to that view of the device, and a pointer to a single common structure for the device.

Attaching Inventory Information

The device_inventory_add() function stores the fields of one inventory_t record in a vertex. The driver can store multiple inventory_t records in a single vertex, but it is customary to store only one. There is no facility to delete an inventory record from a vertex.

The device_inventory_get_next() function is used to read out each of the inventory_t structures in turn. Normally the driver does not have any reason to inspect these. However, the function does not return a copy of the structure; it returns the address of the actual structure in the vertex. The fields of the structure can be modified by the driver.

One field of the inventory_t is particularly important: the controller number is conventionally used to provide ordinal numbering of similar devices. The device_controller_number_get() function returns the controller number from the first (and usually the only) inventory_t structure in a vertex. It fails if there is no inventory data in the vertex.

When the driver can assign an ordinal numbering to multiple devices, it should record that numbering by setting unique controller numbers in each master vertex for the similar devices. This can be done most easily by calling device_controller_number_set(). Typically this would be done in an ioctl call from the application that has determined a stable, global numbering of devices (see “Device Management File” in Chapter 2).

Attaching Attributes

A file attribute is an arbitrary block of information associated with a file inode. Attributes were introduced with the XFS filesystem (see the attr(1) and attr_get(2) reference pages), but the /hw filesystem also supports them. You can store file attributes in hwgraph vertexes, and they can be retrieved by user processes.

The functions that a driver uses to manage attributes are summarized in Table 8-22 (all are detailed in the reference page hwgraph.lblinfo(d3x)).

Table 8-22. Functions to Manage Attributes

Function Name	Header Files	Purpose
hwgraph_info_add_LBL()	hwgraph.h	Attach a labelled attribute to a vertex.
hwgraph_info_get_LBL()	hwgraph.h	Retrieve an attribute by name.
hwgraph_info_replace_LBL()	hwgraph.h	Replace the value of an attribute by name.
hwgraph_info_remove_LBL()	hwgraph.h	Remove an attribute from a vertex.
hwgraph_info_export_LBL()	hwgraph.h	Make an attribute visible to user code.
hwgraph_info_unexport_LBL()	hwgraph.h	Make an attribute invisible.

An attribute consists of a name (a character string), a pointer-sized integer, and a length. When the length is zero, the attribute is “unexported,” that is, not visible to the attr command nor to the attr_get() function. All attributes are initially unexported. An unexported attribute can be retrieved by a driver, but not by a user process.

The value of an attribute is just a pointer; it can be an integer, a vertex handle, or an address of any kind of information. You can use attributes to hold any kind of information you want to associate with a vertex. (For one example, you could use an attribute to contain mode-bits that determine how a device should be treated.)

Attribute storage is not sophisticated. Attribute names are stored sequentially in a string table that is part of the vertex, and looked up in a sequential search. The attribute scheme is meant for convenient storage of a few attributes per vertex, each having a short name.

When you export an attribute, you assert that the value of the attribute is a valid address in kernel virtual memory, and the export length is its correct length. The attr_get() function relies on these points. A user process can retrieve a copy of an attribute by calling attr_get(). The attribute value is copied from the kernel address space to the user address space. This is a convenient route by which you can export driver internal data to user processes, without the complexity of memory mapping or ioctl calls.

Retrieving Administrator Attributes

The system administrator can use the DEVICE_ADMIN statement to attach a labelled attribute to any device special file in the hwgraph, and can use DRIVER_ADMIN to store a labelled attribute for the driver (see “Storing Device and Driver Attributes” in Chapter 2).

These statements are processed at boot time. At this time, the driver might not be loaded, and the device special file might not have been created in the hwgraph. However, the attributes are saved. When a driver creates a hwgraph vertex that is the target of a DEVICE_ADMIN statement, the labelled attributes are attached to the vertex automatically.

Your driver can request an administrator attribute for a specific device using hwgraph_info_get_LBL() directly, as described above under “Attaching Attributes”. Or you can call device_admin_info_get() (see the reference page hwgraph.admin(d3x) ). The returned value is the address of a read-only copy of the value string.

Your driver can request an attribute that was addressed to the driver with DRIVER_ADMIN using device_driver_admin_info_get(). The returned value is the address of a read-only copy of the value string from the DRIVER_ADMIN statement.

User Process Administration

The kernel supplies a small group of functions, summarized in Table 8-23, that help a driver upper-half routine learn about the current user process.

Table 8-23. Functions for User Process Management

Function Name	Header Files	Purpose
`drv_getparm(D3)`	ddi.h	Retrieve kernel state information.
`drv_priv(D3)`	ddi.h	Test for privileged user.
`drv_setparm(D3)`	ddi.h	Set kernel state information.
`proc_ref(D3)`	ddi.h	Obtain a reference to a process for signaling.
`proc_signal(D3)`	ddi.h & signal.h	Send a signal to a process.
`proc_unref(D3)`	ddi.h	Release a reference to a process.

Note: When porting an older driver, you may find direct reference to a user structure. That is no longer available. Any reference to a user structure should be eliminated or replaced by one of the functions in Table 8-23.

Use drv_getparm() to retrieve certain miscellaneous bits of information including the process ID of the current process. In a character device driver, the current process is the user process that caused entry to the driver, for example by calling the open(), ioctl(), or read() system functions. In a block device driver, the current process has no direct relationship to any particular user; it is usually a daemon process of some kind.

The drv_setparm() function is primarily of use to terminal drivers.

The drv_priv() function tests a cred_t object to see if it represents a privileged user. A cred_t object is passed in to several driver entry points, and the address of the current one can be retrieved drv_getparm().

Sending a Process Signal

In traditional UNIX kernels, a device driver identified the current user process by the address of the proc_t structure that the kernel uses to represent a process. Direct use of the proc_t is no longer supported by IRIX. The reason is that the contents of the proc_t change from release to release, and also differ between 64-bit and 32-bit kernels.

The most common use of the proc_t by a driver was to send a signal to the process. This capability is still supported. To do it, take three steps:

Call proc_ref() to get a process handle, a number unique to the current process. The returned value must be treated as an arbitrary number (in some releases of IRIX it was the proc_t address, but this is not the defined behavior of the function).
Use the process handle as an argument to proc_signal(), sending the signal to the process.
Release the process handle by calling proc_unref().

The third step is important. In order to keep the process handle valid, IRIX retains information about the process to which it is related. However, that process could terminate (possibly as a result of the signal the driver sends) but until the driver announces that it is done with the handle, the kernel must try to retain process information.

It is especially important to release a process handles before unloading a loadable driver (see “Entry Point unload()” in Chapter 7).

Waiting and Mutual Exclusion

The kernel supplies a rich variety of functions for waiting and for mutual exclusion. In order to use these features well, you must understand the different purposes for which they are designed. In particular, you must clearly understand the distinction between waiting and mutual exclusion (or locking).

Mutual Exclusion Compared to Waiting

Mutual exclusion allows one entity to have exclusive use of a global resource, temporarily denying use of the resource to other entities. Mutual exclusion normally does not require waiting when software is carefully designed—the resource is normally free when it is requested. A driver that calls a mutual exclusion function expects to proceed without delay—although there is a chance that the resource is in use, and the driver will have to wait.

The kernel offers an array of functions for mutual exclusion, and the choice among them can be critical to performance. The functions are reviewed in the following topics:

“Basic Locks” covers basic locks, once required by device drivers, and useful in multiprocessors.
“Long-Term Locks” covers sleep locks, which can be held for longer periods.
“Reader/Writer Locks” covers a class of locks that allow multiple, concurrent, read-only access to resources that are infrequently changed.
“Priority Level Functions” discusses the traditional UNIX method of mutual exclusion, now obsolete and dangerous.

Waiting allows a driver to coordinate its actions with a specific event or action that occurs asynchronously. A driver can wait for a specified amount of time to pass, wait for an I/O action to complete, and so on. When a driver calls a waiting function, it expects to wait for something to happen—although there is a chance that the expected event has already happened, and the driver will be able to continue at once.

The kernel offers several functions that allow you to wait for specific events; and also offers functions for general synchronization. These are covered in the following topics:

“Waiting for Time to Pass” covers timer-related functions.
“Waiting for Memory to Become Available” covers memory allocation waits.
“Waiting for Block I/O to Complete” covers waits used in the pfxstrategy() entry point.
“Waiting for a General Event” covers the general-purpose functions that you can adapt to any synchronization problem.

The most general facility, the semaphore, can be used for synchronization and for locking. This topic is covered under “Semaphores”.

Basic Locks

IRIX supports basic locks using functions compatible with SVR4. These functions are summarized in Table 8-24.

Table 8-24. Functions for Basic Locks

Function Name	Header Files	Purpose
`LOCK(D3)`	ksynch.h & types.h	Acquire a basic lock, waiting if necessary.
`LOCK_ALLOC(D3)`	ksynch.h,kme m.h & types.h	Allocate and initialize a basic lock.
`LOCK_DEALLOC(D3)`	ksynch.h & types.h	Deallocate an instance of a basic lock.
`LOCK_INIT(D3)`	ksynch.h & types.h	Initialize a basic lock that was allocated statically, or reinitialize an allocated lock.
`LOCK_DESTROY(D3)`	ksynch.h & types.h	Uninitialize a basic lock that was allocated statically.
`TRYLOCK(D3)`	types.h & ksynch.h	Try to acquire a basic lock, returning a code if the lock is not currently free.
`UNLOCK(D3)`	types.h & ksynch.h	Release a basic lock.

Basic locks are objects of type lock_t. Although functions are provided for allocating and freeing them, a basic lock is a very small object. Locks are typically allocated as fields of structures or as global variables.

Call LOCK() to seize a lock and gain possession of the resource for which it stands. Release the lock with UNLOCK(). These functions are optimized for mutual exclusion in the available hardware, and may be implemented differently in uniprocessors and multiprocessors. However, the programming and binary interface is the same in all systems.

Basic locks are implemented as spinning locks in multiprocessors. In releases before IRIX 6.4, the basic lock was the only kind of lock that you could use for mutual exclusion between the upper half of a driver and its interrupt handler (because the interrupt handler could not sleep). Now, interrupt handlers run as threads and can sleep, so you have a choice between basic locks and mutex locks for this purpose.

The code in Example 8-4 illustrates the use of LOCK and UNLOCK in implementing a simple last-in-first-out (LIFO) queueing package. In these functions, the time between locking a queue head and releasing it is only a few microseconds.

Example 8-4. LIFO Queue Using Basic Locks

typedef struct qitem {
   qitem *next; ...other fields...
} qitem_t;
typedef struct lifo {
   qitem *latest;
   lock_t grab;
} lifo_t;
void putlifo(lifo_t *q, qitem_t *i)
{
   int lockpl = LOCK(&q->grab,plhi);
   i->next = q->latest;
   q->latest = i;
   UNLOCK(&q->grab,lockpl);
}
qitem_t *poplifo(lifo_t *q)
{
   int lockpl = LOCK(&q->grab,plhi);
   qitem_t *ret = q->latest;
   q->latest = ret->next;
   UNLOCK(&q->grab,lockpl);
   return ret;
}

This is a typical use of basic locks: to ensure that for a brief period, only one thread in the system can update a queue. Basic locks are optimized for such uses. If they are used in situations where they can be held for significant lengths of time (100 microseconds or longer), system performance can suffer, because one or more CPUs can be “spinning” on the locks and this can delay useful processing.

Long-Term Locks

IRIX provides three types of locks that can suspend the caller when the lock is claimed: mutex locks, sleep locks, and reader-writer locks. Of these, mutex locks are preferred.

Using Mutex Locks

As their name suggests, mutex locks are designed for mutual exclusion. The IRIX implementation of mutex locks is compatible with the kmutex_t lock type of SunOS, but optimized for use in SGI hardware systems. The mutex functions are summarized in Table 8-25.

Table 8-25. Functions for Mutex Locks

Function Name	Header Files	Purpose
`MUTEX_ALLOC(D3)`	types.h & kmem.h & ksynch.h	Allocate and initialize a mutex lock.
`MUTEX_INIT(D3)`	types.h & ksynch.h	Initialize an existing mutex lock.
`MUTEX_DESTROY(D3)`	types.h & ksynch.h	Deinitialize a mutex lock.
`MUTEX_DEALLOC(D3)`	types.h & ksynch.h	Deinitialize and free a dynamically allocated mutex lock.
`MUTEX_LOCK(D3)`	types.h & kmem.h & ksynch.h	Claim a mutex lock.
`MUTEX_TRYLOCK(D3)`	types.h & ksynch.h	Conditionally claim a mutex lock.
`MUTEX_UNLOCK(D3)`	types.h & ksynch.h	Release a mutex lock.
`MUTEX_OWNED(D3)`	types.h & ksynch.h	Query if a mutual exclusion lock is available.
`MUTEX_MINE(D3)`	types.h & ksynch.h	Test if a mutex lock is owned by this process.

Although allocation and deallocation functions are supplied, a mutex_t type is a small object that is normally allocated as a static variable or as a field of a structure. The MUTEX_INIT() operation prepares a statically-allocated mutex_t for use.

Once initialized, a mutex lock is used to gain exclusive use of the resource with which you have associated it. The mutex lock has the following important advantages over a basic lock:

The mutex lock can safely be held over a call to a function that sleeps.
The mutex lock supports inquiry functions such as MUTEX_OWNED or MUTEX_MINE.
When a debugging kernel is used (see “Including Lock Metering in the Kernel Image” in Chapter 10) a mutex lock can be instrumented to keep statistics of its use.

The mutex lock implementation provides priority inheritance. When a low-priority process (or kernel thread) owns a mutex lock and a high-priority process or thread attempts to seize the lock and is blocked, the process holding the lock is temporarily given the higher priority of the blocked process. This hastens the time when the lock can be released, so that a low-priority process does not needlessly impede a higher-priority process.

In order to implement priority inheritance and retain high performance, the mutex lock is subject to the restriction that it must be unlocked by the same process or thread that locked it. It cannot be locked in one process or thread identity and unlocked in another.

You can use mutex locks to coordinate the use of global variables between upper-half entry points of a driver, and between the upper-half code and the interrupt handler. You should prefer a mutex lock to a basic lock in any case where the worst-case program path could hold the lock for a time of 100 microseconds or more.

Mutex locks become inefficient when there is high contention for the lock (that is, when the probability of having to wait is high), because when a process has to wait for a lock, a thread switch takes place. When there is high contention for a lock, it is usually better to use a basic lock, because waiting threads simply spin; they do not execute a context switch.

Using Sleep Locks

IRIX supports sleep lock functions that are compatible with SVR4. These functions are summarized in Table 8-26.

Table 8-26. Functions for Sleep Locks

Function Name	Header Files	Purpose
`SLEEP_ALLOC(D3)`	types.h & kmem.h & ksynch.h	Allocate and initialize a sleep lock.
`SLEEP_DEALLOC(D3)`	types.h & ksynch.h	Deinitialize and deallocate a dynamically allocated sleep lock.
`SLEEP_INIT(D3)`	types.h & ksynch.h	Initialize an existing sleep lock.
`SLEEP_DESTROY(D3)`	types.h & ksynch.h	Deinitialize a sleep lock.
`SLEEP_LOCK(D3)`	types.h & ksynch.h & param.h	Acquire a sleep lock, waiting if necessary until the lock is free.
`SLEEP_LOCKAVAIL(D3)`	types.h & ksynch.h	Query whether a sleep lock is available.
`SLEEP_LOCK_SIG(D3)`	types.h & ksynch.h & param.h	Acquire a sleep lock, waiting if necessary until the lock is free or a signal is received.
`SLEEP_TRYLOCK(D3)`	types.h & ksynch.h	Try to acquire a sleep lock, returning a code if it is not free.
`SLEEP_UNLOCK(D3)`	types.h & ksynch.h	Release a sleep lock.

Although allocation and deallocation functions are supplied, a sleep_t type is a small object that is normally allocated as a static variable or as a field of a structure. The SLEEP_INIT() operation prepares a statically-allocated sleep_t for use. (In IRIX 6.2, a sleep_t is identical to a sema_t, but this situation could change in a future release.)

A sleep lock is similar to a mutex lock in that it is used for mutual exclusion between processes, and can be held across a function call that sleeps. A sleep lock does not have either the advantages or the restrictions of a mutex lock:

A sleep lock can be seized by one process and released by another.
A sleep lock can be set in an upper-half entry point and released in an interrupt routine.
A sleep lock does not provide priority inheritance. When a low-priority process holds a sleep lock, a higher-priority process can be blocked, causing a priority inversion.
A sleep lock does not support the instrumentation or the query functions supported for mutex locks.

Reader/Writer Locks

Reader/writer locks are similar to sleep locks in that they are designed for mutually exclusive control of resources for relatively long periods of time. However, Reader/Writer locks are optimized for the case in which the resource is often used by processes that only interrogate it (readers), but only rarely used by processes that modify it (writers).

Reader/writer locks compatible with SVR4 are introduced in IRIX 6.2. The functions are summarized in Table 8-27.

Table 8-27. Functions for Reader/Writer Locks

Function Name	Header Files	Purpose
`RW_ALLOC(D3)`	types.h & kmem.h & ksynch.h	Allocate and initialize a reader/writer lock.
`RW_DEALLOC(D3)`	types.h & ksynch.h	Deallocate a reader/writer lock.
`RW_INIT(D3)`	types.h & ksynch.h	Initialize an existing reader/writer lock.
`RW_DESTROY(D3)`	types.h & ksynch.h	Deinitialize an existing reader/writer lock.
`RW_RDLOCK(D3)`	types.h & ksynch.h & param.h	Acquire a reader/writer lock as reader, waiting if necessary.
`RW_TRYRDLOCK(D3)`	types.h & ksynch.h	Try to acquire a reader/writer lock as reader, returning a code if it is not free.
`RW_TRYWRLOCK(D3)`	types.h & ksynch.h	Try to acquire a reader/writer lock as writer, returning a code if it is not free.
`RW_UNLOCK(D3)`	types.h & ksynch.h	Release a reader/writer lock as reader or writer.
`RW_WRLOCK(D3)`	types.h & ksynch.h & param.h	Acquire a reader/writer lock as writer, waiting if necessary.

Although allocation and deallocation functions are supplied, a mrlock_t type is a small object that is normally allocated as a static variable or as a field of a structure. The RW_INIT() operation prepares a statically-allocated mrlock_t for use.

A process that intends to modify a resource uses RW_WRLOCK to claim it. This process waits until the resource is not in use by any process, then it gains exclusive access. Only one process is allowed to hold a reader/writer lock as a writer. All other processes, readers or writers, wait until the writer releases the lock.

A process that intends only to interrogate a resource uses RW_RDLOCK to gain access. If a writer holds the lock, the process waits. When the lock is free, or is held only by other readers, the process continues. More than one reader can hold a reader/writer lock at one time. It is also valid for a reader to “double-trip” a reader/writer lock; that is, claim it two or more times. The reader must release the lock as many times as it claimed the lock.

A reader/writer lock serves the same basic purpose as a sleep lock, but it is more efficient in a multiprocessor when there are frequent, read-only uses of a resource.

Priority Level Functions

In traditional UNIX systems, one set of functions served all purposes of synchronization and locking: the set-priority-level, or spl, functions. These functions are still available in IRIX, and are summarized in Table 8-28.

Table 8-28. Functions to Set Interrupt Levels

Function Name	Header Files	Purpose
`splbase(D3)`	ddi.h	Block no interrupts.
`splhi(D3)`	ddi.h	Block all I/O interrupts.
`splx(D3)`	ddi.h	Restore previous interrupt level.

Calls to these functions are commonly found in device drivers being ported from uniprocessors. Such drivers rely on the use of splhi() to guarantee exclusive use of global resources.

The spl functions listed in Table 8-28 are supported by IRIX, but you are strongly advised not to use them. In a multiprocessor, the functions affect only the interrupt handling of the current CPU. Other CPUs in the system continue to handle interrupts, including interrupts initiated by the driver that called splhi().

A driver should use locks, synchronization variables, and other tools to control access to resources. Such a driver never needs an spl function. This improves performance in a multiprocessor, does not harm performance in a uniprocessor, and reduces the latency of all interrupts.

Waiting for Time to Pass

The kernel offers functions for timed delays, as summarized in Table 8-29.

Table 8-29. Functions for Timed Delays

Function Name	Header Files	Purpose
`delay(D3)`	ddi.h	Delay for a specified number of clock ticks.
`drv_hztousec(D3)`	ddi.h	Convert clock ticks to microseconds.
`drv_usectohz(D3)`	ddi.h	Convert microseconds to clock ticks.
`drv_usecwait(D3)`	ddi.h	Busy-wait for a specified interval.
`dtimeout(D3)`	ddi.h & ksynch.h	Schedule a function execute on a specified processor after a specified length of time.
`itimeout(D3)`	ddi.h & ksynch.h	Schedule a function to be executed after a specified number of clock ticks.
`fast_itimeout()`	ddi.h & ksynch.h	Same as `itimeout()` but takes an interval in “fast ticks.”
`fasthzto()`	types.h & time.h	Returns the value of a struct timeval as a count of “fast ticks.”
`timeout(D3)`	ddi.h & ksynch.h	Schedule a function to be executed after a specified number of clock ticks.
`untimeout(D3)`	ddi.h	Cancel a previous itimeout or fast_itimeout request.
`untimeout_func(D3)`	ddi.h	Cancel a previous itimeout or fast_itimeout request by function name.

Time Units

The basic time unit is the “tick.” Its value can differ between hardware platforms and between versions of IRIX. The drvhztousec() and drvusectohz() functions convert between ticks and microseconds in the current system. Use them in order to schedule a delay in a portable manner. (However, the timer function precision is the tick, not the microsecond.)

The “fast tick” is a fraction of a tick. Like the tick, the fast tick's value can differ between systems. Use fasthzto() to convert from microseconds to fast ticks.

Timer Support

Timer support is based on the idea of a “callback” function. You specify the following to dtimeout(), itimeout(), timeout() or fast_itimeout():

an interval in clock ticks or fast ticks
a function to be called at the expiration of the interval
one or more arguments to be passed to the function
a priority (interrupt) level at which the function should run

After a delay of at least the length requested, the function is called. The function is entered asynchronously. On a uniprocessor, it can interrupt execution of an upper-half routine. On a multiprocessor, it can execute concurrently with an upper-half routine or with an interrupt handler or a different timeout function. (Use locks or mutexes for mutual exclusion.)

The difference between itimeout() and timeout() is that the latter takes no argument values to be passed to the function when it is called. In order to get a repeated series of timer events, start a new timeout from the callback function.

The untimeout() and untimeout_func() functions cancel a pending timeout. In a loadable driver that has an pfxunload() entry point, cancel any pending timeouts before unloading.

The STREAMS_TIMOUT macro supplies similar timeout capability for a STREAMS driver (see “Special Considerations for Multiprocessing ” in Chapter 22).

Short-Term Delay Support

In rare circumstances, a driver needs to pause briefly between two hardware operations. For example, the SGI support for external interrupts in the Challenge and Onyx computers sometimes needs to set a high output level, wait for a brief, precise interval, then set a low output level.

The drv_usecwait() function supports this type of very short, precisely-timed delay. It “spins” for a specified number of microseconds, then returns to the caller. The CPU does nothing else during this period, so clearly a delay of more than a few microseconds can interfere with other work. Furthermore, if interrupts are disabled during the wait, the response to another interrupt is delayed also—the delay contributes directly to the “latency” of interrupt handling.

Waiting for Memory to Become Available

Whenever you request memory of any kind, you must allow for the possibility that the memory will not be available. When you allocate memory in bulk (see “General-Purpose Allocation”) using kmem_alloc() you have the option of receiving a null response, or of waiting for the memory to be available.

When you request memory for specific object types (see “Allocating Objects of Specific Kinds”) there is usually no choice; the functions sleep until they can acquire an object of the requested type.

Within a STREAMS driver you have the ability to schedule a callback function to be entered when memory for a message buffer becomes available (see the bufcall(D3) reference page).

Waiting for Block I/O to Complete

The pfxstrategy() routine initiates the I/O operation to fill a buffer based on a buf_t structure. Then it has to wait for the I/O to complete. The functions for managing this synchronization are summarized in Table 8-30.

Table 8-30. Functions for Synchronizing Block I/O

Function Name	Header Files	Purpose
`biodone(D3)`	ddi.h	Release buffer after I/O and wake up waiting process.
`bioerror(D3)`	ddi.h	Manipulate error fields in a buf_t.
`biowait(D3)`	ddi.h	Suspend process pending completion of I/O.
`geterror(D3)`	ddi.h	Retrieve error number from a buf_t.
`physiock(D3)`	ddi.h	Validate a raw I/O request and pass to a strategy function.
`uiophysio(D3)`	ddi.h	Validate a raw I/O request and pass to a strategy function.
`undma(D3)`	ddi.h	Unlock physical memory after I/O complete.
`userdma(D3)`	ddi.h	Lock physical memory in user space.

How the strategy() Entry Point Is Called

The pfxstrategy() entry point is called directly from the filesystem or virtual memory management, or it can be called indirectly from a pfxread() or pfxwrite() entry point (see “Calling Entry Point strategy() From Entry Point read() or write()” in Chapter 7).

Strategies of the strategy() Entry Point

Typically the pfxstrategy() routine must interact with its interrupt handler. The pfxstrategy() routine can be designed in either of two ways, synchronous or asynchronous.

The synchronous pfxstrategy() routine initiates every I/O operation. Its interrupt handler is responsible only for detecting and signalling the completion of one I/O. The pfxstrategy() routine proceeds as follows:

Lock the data buffer in memory using userdma().
Place the address of the buf_t where the pfxintr() entry point can find it.
Program the device (see “Setting Up a DMA Transfer”) and initiate the I/O activity.
Call biowait().

When the interrupt handler is entered, the handler uses bioerror() if necessary, and biodone() to signal the completion of the I/O. Then it exits. The strategy code, which is waiting in the call to biowait(), regains control following the call to biodone(), and can use geterror() to check the results.

The asynchronous pfxstrategy() routine only initiates the first I/O operation of a series, and never waits. It proceeds as follows:

Lock the data buffer in memory using userdma().
Append the address of the buf_t to a queue shared with the interrupt handler.
If the queue was empty, no I/O is in progress. Call a subroutine that programs the device and initiates the I/O.
Return to the caller. The caller (a filesystem or paging system or uiophysio()) waits using biowait().

When the interrupt occurs, the handler proceeds as follows:

The first queued buf_t has completed. Remove it from the queue.
Apply bioerror() if necessary, and biodone() to the buf_t. This releases the caller of the strategy routine from biowait().
If any operations remain in the queue, call a subroutine to program and initiate the next one.

Waiting for a General Event

There are causes for synchronization other than time, block I/O, and memory allocation. For example, there is no defined interface comparable to biowait()/biodone() to mediate between an interrupt handler and the pfxread() or pfxwrite() entry points. You must design a mechanism of your own, using either a synchronization variable or the sleep()/wakeup() function pair.

Using sleep() and wakeup()

The sleep() and wakeup() function pair are the simplest, oldest, and least efficient of the general synchronization mechanisms. They are summarized in Table 8-31.

Table 8-31. Functions for Synchronization: sleep/wakeup

Function Name	Header Files	Purpose
`sleep(D3)`	ddi.h & param.h	Suspend execution pending an event.
`wakeup(D3)`	ddi.h	Waken a process waiting for an event.

Used carefully, these functions are suitable for simple character device drivers. However, when you are writing new code or converting a driver to multiprocessing you should avoid them and use synchronization variables instead (see “Using Synchronization Variables”).

The basic concept is that the upper-layer routine calls sleep(n) in order to wait for an event that is keyed to an arbitrary address n. Typically n is a pointer to a data structure related to an I/O operation. The interrupt handler executes wakeup(n) to cause the sleeping process to resume execution.

The main reason to avoid sleep() is that, in a multiprocessor system, it is hard to ensure that sleeping always begins before wakeup() is called. The usual intended sequence of events is as follows:

Upper-half routine initiates a device operation that will lead to an interrupt.
Upper-half routine executes sleep(n).
Interrupt occurs, and handler executes wakeup(n).

In a multiprocessor-aware driver (one with D_MP in its pfxdevflag constant; see “Driver Flag Constant” in Chapter 7), there is a small chance that the interrupt can occur, calling wakeup(n), before the sleep(n) call has been completed. Because sleep() has not been called, the wakeup() is lost. When the sleep() call completes, the process sleeps forever. Synchronization variables are designed to handle this case.

Using Synchronization Variables

Synchronization variables, a feature of UNIX SVR4, are supported by IRIX beginning with release 6.2. These functions are summarized in Table 8-32.

Table 8-32. Functions for Synchronization: Synchronization Variables

Function Name	Header Files	Purpose
`SV_ALLOC(D3)`	types.h & sema.h	Allocate and initialize a synchronization variable.
`SV_DEALLOC(D3)`	types.h & sema.h	Deinitialize and deallocate a synchronization variable.
`SV_INIT(D3)`	types.h & sema.h	Initialize an existing synchronization variable.
`SV_DESTROY(D3)`	types.h & sema.h	Deinitialize a synchronization variable.
`SV_BROADCAST(D3)`	types.h & sema.h	Wake all processes sleeping on a synchronization variable.
`SV_SIGNAL(D3)`	types.h & sema.h	Wake one process sleeping on a synchronization variable.
`SV_WAIT(D3)`	types.h & sema.h	Sleep until a synchronization variable is signalled.
`SV_WAIT_SIG(D3)`	types.h & sema.h	Sleep until a synchronization variable is signalled or a signal is received.

A synchronization variable is a memory object of type sv_t, representing the occurrence of an event. You can allocate objects of this type dynamically, or declare them as static variables or as fields of structures.

One or more processes may wait for an event using SV_WAIT(). An interrupt handler or timer callback function can signal the occurrence of an event using SV_SIGNAL (to wake up only one waiting process) or SV_BROADCAST (to wake up all of them).

SV_WAIT is specifically designed to handle the difficult case that arises when the driver needs to initiate an I/O operation and then sleep, and do these things in such a way that it always begins to sleep before the SV_SIGNAL can possibly be issued. The procedure is done as follows:

The driver seizes a basic lock (see “Basic Locks”) or a mutex lock (see “Using Mutex Locks”) that is also used by the interrupt handler.

A LOCK() call returns an integer that is needed later.
The driver initiates an I/O operation that can lead to an interrupt.
The driver calls SV_WAIT, passing the lock it holds and an integer, either the value returned by LOCK() or a zero if the lock is a mutex lock.
In one indivisible operation, SV_WAIT releases the lock and begins waiting on the synchronization variable.
The interrupt handler or other process is entered, and seizes the lock.

This step ensures that, if the interrupt handler or other process is entered preceding the SV_WAIT call, it will not proceed until SV_WAIT has completed.
The interrupt handler or other process does its work and calls SV_SIGNAL to release the waiting driver.

This process is sketched in Example 8-5.

Example 8-5. Skeleton Code for Use of SV_WAIT

lock_t seize_it;
sv_t wait_on_it;
initiator(...)
{
   int lock_cookie;
   for( as often as necessary )
   {
      lock_cookie = LOCK(&seize_it,PL_ZERO);
      [do something that causes a later interrupt]
      SV_WAIT(&wait_on_it, 0, &seize_it, lock_cookie);
      [interrupt has been handled]
   }
}
 
void handler(...)
{
   int lock_cookie = LOCK(&seize_it,PL_ZERO);
   [handle the interrupt]
   SV_SIGNAL(&wait_on_it);
   UNLOCK(&seize_it);
}

If it is necessary to use a semaphore as the lock, the header file sys/sema.h declares versions of SV_WAIT that accept a semaphore and a synchronization variable. The combination of a mutual exclusion object and a synchronization variable ensures that even in a multiprocessor, the interrupt handler cannot exit before the driver has entered a predictable wait state.

Tip: When a debugging kernel is used, you can display statistics about the use of a given synchronization variable. See “Including Lock Metering in the Kernel Image” in Chapter 10.

Semaphores

The semaphore is a generalized tool that can be used for both mutual exclusion and for waiting. The IRIX kernel support for semaphores is summarized in Table 8-33.

Table 8-33. Functions for Semaphores

Function Name	Header Files	Purpose
`cpsema(D3)`	sema.h & types.h	Conditionally perform a “P” or wait semaphore operation.
`cvsema(D3)`	sema.h & types.h	Conditionally perform a “V” or release semaphore operation.
`freesema(D3)`	sema.h & types.h	Free the resources associated with a semaphore.
`initnsema(D3)`	sema.h & types.h	Initialize a semaphore to a given value.
`initnsema_mutex(D3)`	sema.h & types.h	Initialize a semaphore to a value of 1.
`psema(D3)`	sema.h & types.h & param.h	Perform a “P” or wait semaphore operation.
`valusema(D3)`	sema.h & types.h	Return the value associated with a semaphore.
`vsema(D3)`	sema.h & types.h	Perform a “V” or signal semaphore operation.

Conceptually, a semaphore contains an integer. The “P” operation claims the semaphore, decrementing its count by 1 (mnemonic: dePlete). If the count is 0 or less, the process waits until the count is greater than 0 before it decrements the semaphore and returns.

The “V” operation increments the semaphore count (mnemonic: reViVe) and wakens any process that is waiting.

Tip: When a debugging kernel is used, you can display statistics about the use of a given semaphore. See “Including Lock Metering in the Kernel Image” in Chapter 10.

Note: In releases before IRIX 6.2, initnsema_mutex() was used to initialize a semaphore in a special way that got the performance of a basic lock in a multiprocessor. Since IRIX 6.2, this function is simply a macro that initializes the semaphore to a count of 1.

Using a Semaphore for Mutual Exclusion

To use a semaphore for locking, initialize it to 1. (This reflects the idea that a process calling a locking function expects to continue.) When you require exclusive use of the associated resource, call psema(). Typically this finds a semaphore count of 1, reduces it to 0, and returns.

When you are finished with the resource, call vsema() to increment the semaphore count, and release any process that is blocked in a psema() call for the same semaphore.

For locking, a semaphore is comparable to a sleep lock. In some systems, the performance of semaphore operations may not be as good as the performance of a mutex lock. In other systems, mutex locks may be implemented using semaphores.

Using a Semaphore for Waiting

To use a semaphore for waiting, initialize it to 0. Then call psema(). Because the semaphore count is 0, the process waits. When the desired event occurs, typically in the interrupt handler, call vsema() to release the waiting process.

This synchronization method is as reliable as a synchronization variable, but it has slightly different behavior. When a synchronization variable is used correctly (see “Using Synchronization Variables”), if the interrupt handler is entered before the SV_WAIT call completes, the interrupt handler waits on a LOCK call.

When a semaphore is used, if the interrupt handler is entered before the psema() call completes, the vsema() operation is done immediately and the interrupt handler continues without waiting. The fact that vsema() was called is stored as a count within the semaphore, where psema() will find it. Because the semaphore can contain this state information, the interrupt handler does not have to be synchronized in time using a lock.

Note: In releases before IRIX 6.2, the vpsema() function was used in a way similar to synchronization variables are used: to release one semaphore and wait on another in an atomic operation. This function is no longer supported; replace it with synchronization variable.

Using Kernel Threads

This section describes kernel system threads and their configuration.

Kernel System Threads

IRIX uses interrupt threads to handle most of its physical interrupts. The section titled “Interrupt Entry Point and Handler” in Chapter 7 describes how to create a kernel thread and how to link it to a physical interrupt in one action. Some drivers perform background processing of events and queues that are not tied to particular physical interrupts. User-mode programs that do this are typically called daemons.

For systems running IRIX 6.5.17 and later, drivers can create kernel threads not associated with particular interrupts. These "system" threads can take all types of locks, block on events or resources, and do anything else that interrupt threads can do. For details on their creation and destruction, see the drv_thread_create(D3) and drv_thread_exit(D3) man pages .

Unlike interrupt handlers, most system threads should not return from their starting function until they are ready to destroy their thread. Most threads should use some form of loop, alternating between processing data and waiting for more data from user programs or from interrupt threads. The following example illustrates the creation and operation of a typical system thread:

Example 8-6. Creation and Operation of a Typical System Thread

...

#include <sys/cmn_err.h>
#include <sys/ddi.h>
...

void
example_system_thread(void * arg0,
                      void * arg1,
                      void * arg2,
                      void * arg3)
{
        /*
         * Loop processing events and sleeping waiting for more
         */
        while (1) {
              /*
               * Wait for more events to occur
               */

              /*
               * Do background work
               */
        }

        /*
         * If we need to exit this thread for some reason
         * we call the below.  This is equivalent to just
         * calling return() from the base function.
         */
        drv_thread_exit();
}

void
example_init(void)
{
        int error;
        void * myarg0, * myarg1;
...
        /*
         * Create a system thread to do background work
         */
         error = drv_thread_create("MyThread", 0, 0, 0,
                                   example_system_thread,
                                   myarg0, myarg1, NULL, NULL);

         if (error) {
            cmn_err(CE_WARN, "Creation of MyThread failed\n");
         }
...
}

Custom Configurations for Kernel Threads

When the irix.sm DEVICE_ADMIN INTR_TARGET directive is used to direct a physical interrupt, it also binds its interrupt handlers to the target CPU. In some situations, such as when running with the SGI Frame Rate Scheduler (FRS), it is desirable to put interrupt handler threads on CPUs in locations other than where their physical interrupts are directed. Systems running IRIX 6.5.16 or later can use the XThread Control Interface (XTCI) to control special behaviors such as this. Users can add XTHREAD entries in the /var/sysgen/system/irix.sm file. Kernel threads not given entries operate with default behavior. After irix.sm is modified, you should run lboot to reconfigure the system.

To preserve compatibility, in the event that conflicting entries are found, XTCI entries will defer to the legacy /var/sysgen/master.d/sgi interface. As in the master.d/sgi interface, system threads can be specified but they can later change their behavior; whereas interrupt threads must adhere throughout their lifetime.

Specific interface entries are of the following format:

XTHREAD: name[*] [BOOT] [FLOAT] [STACK s] [PRI p] [CPU m...n]

Entry descriptions are as follows:

XTHREAD:

Indicates that any line beginning with XTHREAD: controls kernel threads. All of the information must be on the same line.

name[*]

Indicates that any thread with a name equal to name is affected by the directives that follow it. If * follows, any thread whose name begins with name is affected.

BOOT

Indicates that the thread stays within the boot cpuset, if one exists.

FLOAT

Indicates that the thread will never be bound to a CPU.

STACK s

Specifies the thread stack size.

PRI p

Specifies the starting thread CPU scheduling priority.

CPU m...n

Specifies a list of CPUs on which to attempt to place the thread, if possible. Threads that cannot be placed on their CPU list will be considered FLOAT. This is comparable to the sysmp() MP_MUSTRUN command for user threads. You can list up to four processors.

Note: At boot time the XTCI mechanism is enabled before IRIX enables its device drivers but after some of the core IRIX services are initialized. Therefore some kernel threads, such as the timeout threads, are not affected by XTCI entries for them.

The following examples illustrate the use of XTHREAD entries:

Example 8-7. XTHREAD FLOAT Entry

XTHREAD: ioc3* FLOAT

On SGI Origin series systems, this entry prevents all of the interrupt handler threads for the IOC3 hardware (including the mouse and keyboard handlers) from being bound to a CPU. This entry is useful for the previously described situation of routing the physical interrupt for the external interrupt (and thus, also the keyboard and mouse) to a CPU running the FRS. Because the FRS controls the CPU, it will not allow mouse and keyboard handlers to run. The FLOAT directive allows them to run on a different CPU.

Example 8-8. XTHREAD CPU Entry

XTHREAD: vme_intrd0 CPU 2

This example forces the kernel interrupt thread for level 0 VME interrupts to run on processor 2.

Prev	Table of Contents	Next
Chapter 7. Structure of a Kernel-Level Driver		Chapter 9. Building and Installing a Driver