Chapter 7. Structure of a Kernel-Level Driver

Chapter 7. Structure of a Kernel-Level Driver
Prev	Part III. Kernel-Level Drivers	Next

A kernel-level device driver consists of a module of subroutines that supply services to the kernel. The subroutines are public entry points in the driver. When an event occurs, the kernel calls one of these entry points. The driver takes action and returns a result code.

This chapter discusses when the driver entry points are called, what parameters they receive, and what actions they are expected to take. For a conceptual overview of the kernel and drivers, see “Kernel-Level Device Control” in Chapter 3. For details on how a driver is compiled, linked, and added to IRIX, see Chapter 9, “Building and Installing a Driver”.

Note: This chapter concentrates on device drivers. Entry points unique to STREAMS drivers are covered in Chapter 22, “STREAMS Drivers”.

The primary topics covered in this chapter are:

“Summary of Driver Structure” summarizes the entry points and how they are made known to the kernel.
“Driver Flag Constant” describes the public constant that documents the driver type for lboot and mload.
“Initialization Entry Points” discusses the entry points at which a driver initializes its own data and its devices.
“Attach and Detach Entry Points ” discusses the entry points that handle dynamic attachment of Peripheral Component Interconnect (PCI) devices.
“Open and Close Entry Points” discusses the entry points called by the open() and close() kernel functions.
“Control Entry Point” documents the entry point called by the ioctl() kernel function.
“Data Transfer Entry Points” documents the entry points called by the read() and write() kernel functions.
“Poll Entry Point” documents the entry point called by the poll() kernel function.
“Memory Map Entry Points” tells how a driver supports memory mapping of devices and buffers.
“Interrupt Entry Point and Handler” discusses the design and operation of interrupt handlers.
“Support Entry Points” describes several entry points that support kernel operations.
“Handling 32-Bit and 64-Bit Execution Models” covers the techniques of supporting user processes that have different execution models.
“Designing for Multiprocessor Use” covers the techniques of making a driver work in a multiprocessor, multithreading environment.

Summary of Driver Structure

A driver consists of a binary object module in ELF format stored in the /var/sysgen/boot directory. As a program, the driver consists of a set of functional entry points that supply services to the IRIX kernel. There is a large set of entry points to cover different situations. Some entry points are historical relics, while others were first defined in IRIX 6.4. No single driver supports all possible entry points.

The entry points that a driver supports must be named according to a specified convention. The lboot command uses entry point names to build tables used by the kernel.

Entry Point Naming and lboot

The device driver makes known which entry points it supports by giving them public names in its object module. The lboot command links together the object modules of drivers and other kernel modules to make a bootable kernel. lboot recognizes the entry points by the form of their names. (See the lboot(1M) and autoconfig(1M) reference pages.)

Driver Name Prefix

A device driver must be described by a file in the /var/sysgen/master.d directory (see “Master Configuration Database” in Chapter 2). In that configuration file you specify the driver prefix, a string of 1 to 14 characters that is unique to that driver. For example, the prefix of the SCSI driver is scsi_.

The prefix string is defined in the /var/sysgen/master.d file only. The string does not have to appear as a constant in the driver, and the name of the driver object file does not have to correspond to the prefix (although the object module typically has a related name).

The lboot command recognizes driver entry points by searching the driver object module for public names that begin with the prefix string. For example, the entry point for the open() operation must have a name that consists of the prefix string followed by the letters “open.”

In this book, entry point names are written as follows: pfxopen(), where pfx stands for the driver's prefix string.

Driver Name Prefix as a Compiler Constant

The driver prefix string appears as part of the name of each public entry point. In addition, you sometimes need the driver prefix string as a character string literal, for example in a PCI driver as an argument to pciio_driver_register(). You would like to define the prefix string in one place and then generate it automatically where needed in the code. The C macro code in Example 7-1 accomplishes this goal.

Example 7-1. Compiling Driver Prefix as a Macro

#define PREFIX_NAME(name) sample_ ## name
/* -----  driver prefix:  ^^^^^^^ defined there only */
#define PREFIX_ONLY PREFIX_NAME( )
#define STRINGIZER(x) # x
#define PREFIX_STRING STRINGIZER(PREFIX_ONLY)

A macro call to PREFIX_STRING generates a character literal (“sample_” in this case). You can use this macro wherever a character literal is allowed, for example, as a function argument. The “##” operator is ANSI C syntax for string concatenation.

Further down, in the STRINGIZER macro, the “#” operator is ANSI C syntax for string (double quoted) substitution.

A call to PREFIX_NAME(name) generates an identifier composed of the prefix concatenated to name. You can define the init entry point as follows:

PREFIX_NAME(init)()
{ ... }

However, this can be confusing to read. You can also define one macro for each entry point, as shown in Example 7-2.

Example 7-2. Entry Point Name Macros

#define PFX_INIT      PREFIX_NAME(init)
#define PFX_START     PREFIX_NAME(start)

Using macros such as these you can define an entry point as follows:

PFX_INIT()
{ ... }

Kernel Switch Tables

The IRIX kernel maintains tables that allow it to dispatch calls to device drivers quickly. These tables are built by lboot based on the names of the driver entry points. The tables are named as follows:

bdevsw	Table of block device drivers
cdevsw	Table of character device drivers
fmodsw	Table of STREAMS drivers
vfssw	Table of filesystem modules (not related to device drivers)

Conceptually, the tables for block and character drivers have one row for each driver, and one column for each possible driver entry point. (Historically, the major device number was the driver's row number in the switch table. This simple data structure is no longer used.)

As lboot loads a driver, it fills in that driver's row of a switch table with the addresses of the driver's entry points. Where an entry point is not defined in the driver object file, lboot leaves the address of a null routine that returns the ENODEV error code. Thus no driver needs to define all entry points—only the ones it can support in a useful way.

The sizes of the switch tables are fixed at boot time in order to minimize kernel data space. The table sizes are tunable parameters that can be set with systune (see the systune(1) reference page).

When a driver is loaded dynamically (see “Configuring a Loadable Driver” in Chapter 9), the associated row of the switch table is not filled at link time but rather is filled when the driver is loaded. When you add new, loadable drivers, you might need to specify a larger switch table. The book IRIX Admin: System Configuration and Operation documents these tunable parameters.

Entry Point Summary

The names of all possible driver entry points and their purposes are summarized in Table 7-1. The entry point names are in alphabetic order, not logical order. Device driver entry points are discussed in this chapter. Entry points to STREAMS drivers are discussed in Chapter 22, “STREAMS Drivers”.

To open the reference page, click on the reference page name. To jump to the discussion of an entry point, click on the “Discussion” cross-reference.

Table 7-1. Entry Points in Alphabetic Order

Entry Point	Purpose	Discussion	Reference Page
pfx`attach`	Attach a new device to the system.	“Entry Point attach()”
pfx`close`	Note the device is not in use.	“Entry Point close()”	`close(D3)`
pfx`devflag`	Constant flag bits for driver features.	“Driver Flag Constant”	`devflag(D1)`
pfx`detach`	Detach a device from the system.	“Entry Point detach()”
pfx`edtinit`	Initialize EISA or VME driver from VECTOR statement.	“Entry Point edtinit()”	`edtinit(D2)`
pfx`halt`	Prepare for system shutdown.	“Entry Point halt()”	`halt(D2)`
pfx`init`	Initialize driver globals at load or boot time.	“Entry Point init()”	`init(D2)`
pfx`intr`	Handle device interrupt (not used).	“Interrupt Entry Point and Handler”	`intr(D2)`
pfx`ioctl`	Implement control operations.	“Control Entry Point”	`ioctl(D2)`
pfx`map`	Implement memory-mapping (IRIX).	“Entry Point map()”	`map(D2)`
pfx`mmap`	Implement memory-mapping (SVR4).	“Entry Point mmap()”	`mmap(D2)`
pfx`open`	Connect a process to a device. Connect a stream module.	“Entry Point open()” “Entry Point open()” in Chapter 22	`open(D2)`
pfx`poll`	Implement device event test.	“Entry Point poll()”	`poll(D2)`
pfx`print`	Display diagnostic about block device.	“Entry Point print()”	`print(D2)`
pfx`read`	Character-mode input.	“Entry Points read() and write()”	`read(D2)`
pfx`reg`	Register a driver at load or boot time.	“Entry Point reg()”
pfx`rput`	STREAMS message on read queue.	“Put Functions wput() and rput()” in Chapter 22	`put(D2)`
pfx`size`	Return logical size of block device.	“Entry Point size()”	`size(D2)`
pfx`srv`	STREAMS service queued messages.	“Service Functions rsrv() and wsrv()” in Chapter 22	`srv(D2)`
pfx`start`	Initialize driver at load or boot time.	“Entry Point start()”	`start(D2)`
pfx`strategy`	Block-mode input and output.	“Entry Point strategy()”	`strategy(D2)`
pfx`unload`	Prepare loadable module for unloading.	“Entry Point unload()”	`unload(D2)`
pfx`unmap`	Note the end of a memory mapping.	“Entry Point unmap()”	`unmap(D2)`
pfx`unreg`	Undo driver registration prior to unloading.	“Entry Point unreg()”
pfx`wput`	STREAMS message on write queue.	“Put Functions wput() and rput()” in Chapter 22	`put(D2)`
pfx`write`	Character-mode output.	“Entry Points read() and write()”	`write(D2)`

Entry Point Usage

No driver supports all entry points. Typical entry point usage is as follows:

A minimal driver for a character device supports pfxinit(), pfxopen(), pfxread(), pfxwrite(), and pfxclose(). The pfxioctl() and pfxpoll() entry points are optional.
A minimal block device driver supports pfxopen(), pfxsize(), pfxstrategy(), and pfxclose().
A minimal pseudo-device driver supports pfxstart(), pfxopen(), pfxmap(), pfxunmap(), and pfxclose() (the latter two possibly as mere stubs).

In addition:

All drivers need a pfxdevflag constant.
Loadable drivers may support pfxunreg() and pfxunload().
A block or character driver for a PCI device should support pfxattach(), pfxdetach(), and pfxreg(). The pfxenable(), pfxdisable(), and pfxerror() entry points are optional.
A block or character driver for aVME, EISA or GIO device should support pfxedtinit().

Entry Point Calling Sequence

Entry points of a nonloadable driver are called as follows.

The first call is to pfxinit() if it exists.
A driver for a VME, EISA, or GIO bus device is then called at its pfxedtinit() entry points once for each VECTOR line that specifies that driver.
The pfxstart() entry point is called, if it exists.
The pfxreg() entry point is called, if it exists.
A driver for a PCI device is called at its pfxattach() entry point once for each device that it supports, as the kernel discovers the devices.
The pfxopen() entry point is called whenever any process opens a device controlled by this driver.
The pfxread(), pfxwrite(), pfxstrategy(), pfxmap(), pfxpoll() and pfxioctl() calls are exercised as long as any device is open.
The pfxunmap() entry point is called when all processes have unmapped a given segment of memory.
The pfxclose() entry point is called when the last process closes a device, so the device is known to be no longer in use.
The pfxdetach() entry point can be called only when a device has been closed.

The sequence of entry points called for a loadable driver is similar, with additional calls that are discussed under “Entry Point unreg()” and “Entry Point unload()”.

Driver Flag Constant

Any device driver or STREAMS module must define a public name pfxdevflag as a static integer. This integer contains a bitmask with one or more of the following flags, which are declared in sys/conf.h:

D_MP	The driver is prepared for multiprocessor systems.
D_MT	The driver is prepared for a multithreaded kernel.
D_PCI_HOT_PLUG_ATTACH	The driver supports the PCI Hot Plug insertion of its devices.
D_PCI_HOT_PLUG_DETACH	The driver supports the PCI Hot Plug removal of its devices.
D_WBACK	The driver handles its own cache-writeback operations.

A typical definition would resemble the following:

int testdrive_devflag = D_MP+D_MT;

A STREAMS module should also provide this flag, but the only relevant bit value for a STREAMS driver is D_MP (see “Driver Flag Constant” in Chapter 22).

The flag value is saved in the kernel switch table with the driver's entry points (see “Kernel Switch Tables”).

When a driver (or STREAMS module) does not define a pfxdevflag, or defines one containing 0, lboot refuses to load it as part of the kernel.

Flag D_MP

You specify D_MP in pfxdevflag to tell lboot that your driver is designed to operate in a multiprocessor system. The top half of the driver is designed to cope with multiple concurrent entries in multiple CPUs. The top and bottom halves synchronize through the use of semaphores or locks and do not rely on interrupt masking for critical sections. These issues are discussed further under “Designing for Multiprocessor Use”.

All drivers must be designed in this fashion and confirm it with D_MP, even drivers written for uniprocessor workstations.

Flag D_MT

Driver interrupt routines execute as independent, preemptable threads of control within the kernel address space (see “Interrupts as Threads”). D_MT indicates that this driver understands that it can be run as one or more cooperating threads, and uses kernel synchronization primitives to serialize access to driver common data structures.

In IRIX 6.4, D_MT does not commit a driver to anything beyond the meaning of D_MP.

Flag D_PCI_HOT_PLUG_ATTACH

This driver supports the PCI Hot Plug insertion of its devices by providing an attach() function that initializes the device hardware and software from a powered-down state while the system is running. A driver can support Hot Plug insertion, Hot Plug removal, or both. This flag has meaning only on an SGI Origin 3000 server series and is ignored on non-PCI drivers.

Flag D_PCI_HOT_PLUG_DETACH

This driver supports the PCI Hot Plug removal of its devices by providing a detach() function that terminates operation of the device hardware and releases all software resources so the device can be powered down while the system is running. A driver can support Hot Plug insertion, Hot Plug removal, or both. This flag has meaning only on an SGI Origin 3000 server series and is ignored on non-PCI drivers.

Flag D_WBACK

You specify D_WBACK in pfxdevflag to tell lboot that a block driver performs any necessary cache write-back operations through explicit calls to dki_dcache_wb() and related functions (see the dki_dcache_wb(D3) reference page).

When D_WBACK is not present in pfxdevflag, the physiock() function ensures that all cached data related to buf_t structures is written back to main memory before it enters the driver's strategy routine. (See the physiock(D3) reference page and “Entry Point strategy()”.)

Flag D_OLD Not Supported

In IRIX versions before IRIX 6.4, a driver was allowed to have no pfxdevflag, or to have one containing only a flag named D_OLD. This flag, or the absence of a flag, requested compatibility handling for an obsolete driver interface. Support for this interface has been withdrawn effective with IRIX 6.4.

Initialization Entry Points

The kernel calls a driver to initialize itself at four different entry points, as follows:

pfx`init`	Initialize self-defining hardware or a pseudo-device.
pfx`edtinit`	Initialize a hardware device based on VECTOR data.
pfx`start`	General initialization.
pfx`reg`	For a driver that supports the pfx`attach()` entry point, register the driver as ready to attach devices.

Historically, these calls were made at different times in the boot process and the driver had different abilities at each time. Now they are all called at nearly the same time. A driver may define any combination of these entry points. Typically a PCI driver will define pfxinit() and pfxreg(), while a VME or EISA device will define pfxinit() and pfxedtinit().

When Initialization Is Performed

The initialization entry points of ordinary (nonloadable) drivers are called during system startup, after interrupts have been enabled and before the message “The system is coming up” is displayed on the console. In all cases, interrupts are enabled and basic kernel services are available at this time. However, other loadable or optional kernel modules might not have been initialized, depending on the sequence of statements in the files in /var/sysgen/system.

Whenever a driver is initialized, the entry points are called in the following sequence:

pfxinit() is called.
pfxedtinit() is called once for each VECTOR statement in reverse order of the VECTOR statements found in /var/sysgen/system files.
pfxstart() is called.
pfxreg() is called.

Initialization of Loadable Drivers

A loadable driver (see “Loadable Drivers” in Chapter 3) is initialized any time it is loaded. This can occur more than once, if the driver is loaded, unloaded, and reloaded. When a loadable driver is configured for autoregister, it is loaded with other drivers during system startup. (For more information on autoregister, see “Configuring a Loadable Driver” in Chapter 9.) Such a driver is initialized at system startup time along with the nonloadable drivers.

Entry Point init()

The pfxinit() entry point is called once during system startup or when a loadable driver is loaded. It receives no input arguments; its prototype is simply:

void pfxinit(void);

You can use this entry point for any of the following purposes:

To initialize global data used by more than one entry point or with more than one device.
To initialize a hardware device that is self-defining; that is, all the information the driver needs is either coded into the driver, or can be gotten by probing the device itself.
To initialize a pseudo-device driver; that is, a driver that does not have real hardware attached.

A driver that is brought into the system by a USE or INCLUDE line in a system configuration file (see “Configuring a Kernel” in Chapter 9) typically initializes in the pfxinit() entry point.

Entry Point edtinit()

The pfxedtinit() entry is designed to initialize devices that are configured using the VECTOR statement in the system configuration file (see “Kernel Configuration Files” in Chapter 2). This includes GIO, EISA, and VME devices. The entry point name is a contraction of “early device table initialization.”

The VECTOR statement specifies hardware details about a device on the VME, GIO, or EISA bus, including such items as iospace addresses, interrupt level, bus number, and a driver-defined integer value referred to as the controller number. The VECTOR statement also specifies the driver that is to manage the device; and it can specify probe operations that let the kernel test for the existence of the device.

When the kernel processes a VECTOR statement during bootstrap, it executes the probe, if one is specified. When the probe is successful (or no probe is given), the kernel makes sure that the specified driver is loaded. Then it stores the hardware parameters from the VECTOR statement in a structure of type edt_t. (This structure is declared in sys/edt.h.)

The kernel calls the specified driver's pfxedtinit() entry one time for each VECTOR statement that named that driver and had a successful probe (or had no probe). VECTOR statements are processed in reverse sequence to the order in which they are coded in /var/sysgen/system files.

The prototype of the pfxedtinit() entry is

void pfxedtinit(edt_t *e);

The edt_t contains at least the following fields (see the system(4) reference page for the corresponding VECTOR parameters):

e_bus_type	Integer specifying the bus type; constant values are declared in `sys/edt.h`, for example ADAP_VME, ADAP_GIO, or ADAP_EISA.
e_adap	For EISA or VME, an integer specifying the adapter (bus) number.
e_ctlr	Value from the VECTOR `ctlr=` parameter; typically a device number used to distinguish one device from another.
e_space	Array of up to three I/O space structures of type iospace_t.

The VME form of the VECTOR statement for IRIX 6.4 is discussed at length under “Defining VME Devices with the VECTOR Statement” in Chapter 12. The operation of the pfxedtinit() entry for VME is discussed under “Initializing a VME Device” in Chapter 13.

Entry Point start()

The pfxstart() entry point is called at system startup, and whenever a loadable driver is loaded. It is called after pfxedtinit() and pfxinit(), but before any other entry point such as pfxopen(). The pfxstart() entry point receives no arguments; its prototype is simply

void pfxstart(void);

The pfxstart() entry point is a suitable place to allocate a poll-head structure using phalloc(), as discussed in “Use and Operation of poll(2)”.

Entry Point reg()

The pfxreg() entry point is specifically intended to allow a driver that supports the pfxattach() entry point (see “Entry Point attach()”) to register with the kernel. At present, the only buses that support device attachment and registration (accessible to OEMs) are the PCI and SCSI buses. The functions used to register as a PCI driver are discussed in “Configuration Register Initialization” in Chapter 20.

Attach and Detach Entry Points

First defined in IRIX 6.3, the pfxattach() entry point informs the driver that the kernel has found a device that matches the driver. This is the time at which the driver initializes data that is unique to one instance of a device. The pfxdetach() entry point informs the driver that the device has been removed from the system. The driver undoes whatever pfxattach() did for that device instance.

Entry Point attach()

The pfxattach() entry point is called to notify the driver that the PCI bus adapter has located a device that has a vendor and device ID for which the driver has registered (see “Entry Point reg()”).

This entry point is typically called during bootstrap, while the kernel is probing the PCI bus. However, for a PCI Hot Plug insert operation it can occur at a later time, if the device is physically plugged in or activated after the system has initialized. In an Origin2000 system, the entry point is executed in the hardware node closest to the device being attached. (See “Allocating Memory in Specific Nodes of a Origin2000 System” in Chapter 8.)

The purpose of the entry point is to make the device usable, including making it visible in the hwgraph by creating vertexes and edges to represent it.

Matching A Device to A Driver

When the system boots up, the kernel probes the PCI bus configuration space and takes a census of active devices. For each device it notes

Vendor and device ID numbers
Requested size of memory space
Requested size of I/O space

The kernel assigns starting bus addresses for memory and I/O space and sets these addresses in the Base Address Registers (BARs) in the device. Then the kernel looks for a driver that has registered a matching set of vendor and device IDs using pciio_driver_register() (for discussion, see “Configuration Register Initialization” in Chapter 20).

If no matching driver has registered, the device remains inactive. For example, the driver might be a loadable driver that has not been loaded as yet. When the driver is loaded and registers, the kernel will match it to any unattached devices.

When the kernel matches a device to its registered driver, the kernel calls the driver's pfxattach() entry point. It passes one argument, a handle to the hwgraph vertex representing the hardware connection point for the device. This handle is used to:

Request PIO and DMA maps on the device
Register an interrupt handler for the device

Completing the hwgraph

The handle passed to pfxattach() addresses the hwgraph vertex that represents a slot on a bus. This is not informative to users, because a card can be plugged into any slot. Nor is this a reliable target for a symbolic link from /dev. In any case, the driver cannot store information in this vertex. At attach time the driver needs to create at least one additional hwgraph vertex in order to:

Create a device vertex for use by user programs.
Provide a vertex to hold the device information.
Establish a well-known, convenient names high up in the /hw filesystem.
Provide extra device names that represent different aspects of the same device (for example, different partitions), or different access modes to the device (a character device and a block device), or different treatments of the device (for example, byte-swapped and nonswapped).
Establish predictable names that satisfy symbolic links that exist in /dev.

Each leaf vertex you create in the hwgraph is a device special file the user can open. You create a leaf vertex by calling hwgraph_block_device_add() or hwgraph_char_device_add(). You can make each leaf vertex distinct by attaching distinct information to it using device_info_set().

You create additional vertexes and edges using the functions discussed under “Hardware Graph Management” in Chapter 8.

Allocating Storage for Device Information

A driver needs to save information about each device, usually in a structure. Fields in a typical structure might include:

Locks or semaphores used for mutual exclusion among upper-half entry points and between them and the interrupt handler.
Addresses of allocated PIO and DMA maps for this device (see “PIO Address Mapping” in Chapter 20 and “DMA Address Mapping” in Chapter 20).
Address of an interrupt connection object for the device (see “Interrupt Signal Distribution” in Chapter 20).
In a block driver, anchors for a queue of buf_t objects being filled or emptied.
Device status flags.

A problem is that at initialization time a driver does not know how many devices it will be asked to manage. In the past this problem has been handled by allocating an array of a fixed number of information structures, indexed by the device minor number.

In a PCI driver, you dynamically allocate memory for an information structure to hold information about the one device being attached. (See “General-Purpose Allocation” in Chapter 8.) You save the address of the structure in the leaf vertex you create, using the device_info_set() function, which associates an arbitrary pointer with a vertex_hdl_t (see hwgraph(d3x) and “Extending the hwgraph” in Chapter 8).

The information structure can easily be recovered in any top-half routine; see “Interrogating the hwgraph” in Chapter 8.

Inserting Hardware Inventory Data

You attach the hardware inventory data for the attached device to the hwgraph vertex passed to the pfxattach() entry point—see “Creating an Inventory Entry” in Chapter 2.

Return Value from Attach

The return code from pfxattach() is tested by the kernel. The driver can reject an attachment. When your driver cannot allocate memory, or fails due to another problem, it should:

Use cmn_err() to document the problem (see “Using cmn_err” in Chapter 10)
Release any objects such as PIO and DMA maps that were created.
Release any space allocated to the device such as a device information structure.
Return an informative return code which might be meaningful in future releases.
A loadable driver's reg() entry point will be called after a driver has been loaded into memory, but before the load process is considered successful. In its reg() function, a typical driver will register itself as supporting a specific device type; for PCI devices this registration is made by a call to pciio_driver_register(). The driver registration results in the driver's attach() entry point being immediately called for any installed matching device type. If a driver's attach() function returns an error code for any device, the driver remains registered and the load process continues without error.

More than one driver can register to support the same vendor ID and device ID. When the first driver fails to complete the attachment, the kernel continues on to test the next, until all have refused or one accepts. The pfxdetach() entry point can only be called if the pfxattach() entry point returns success (0).

PCI Hot Plug Insert Operation

A PCI Hot Plug insert operation calls the device driver attach() function registered for the device being inserted. That driver must provide a complete attach() function that can initialize the device from a powered-down state while the system is running. A driver must indicate that it supports the PCI Hot Plug insertion by setting the D_PCI_HOT_PLUG_ATTACH flag in its pfxdevflag constant. Only drivers that indicate that they support the Hot Plug insert will have their attach() function called for a Hot Plug insert operation that targets one of their devices.

The device initialization process includes the device hardware configuration and the allocation of software resources. The resources that are normally available at system startup, such as memory on a specific node, may not be available once the system is running. An attach() function that uses Hot Plug must plan for and handle this possible failure scenario. If a Hot Plug insert fails, the driver must clean up and return all resources that were allocated as part of the failed insert operation; the kernel will not try to recover from a failed Hot Plug insert operation.

The attach() function returns a status code that indicates if the attach was successful or not. A nonzero code from sys/errno.h indicates the specific error and the device is marked as having an incomplete startup. An incomplete startup (Hot Plug insert) operation can be retried, so the driver should leave the device and its software resources in a state where a subsequent attempt to insert (startup) the device can succeed.

Entry Point detach()

The pfxdetach() entry point is called when the kernel decides to detach a device. As of IRIX 6.4 this is only done for PCI devices. The need to detach can be created by a hardware failure or a PCI Hot Plug removal operation. If the entry point is not defined, the device cannot be detached.

In general, the detach entry point must undo as much as possible of the work done by the pfxattach() entry point (see “Entry Point attach()”). This includes such actions as:

Disconnect a registered interrupt handler.
If any I/O operations are pending on the device, cancel them. If any top-half entry points are waiting on the completion of these operations, wake them up.
Release all software objects allocated, such as PIO maps, DMA maps, and interrupt objects.
Release any allocated kernel memory used for buffers or for a device information structure.
Detach and release any edges and vertexes in the hwgraph created at attach time.

The state of the device itself is not known. If the detach code attempts to reset the device or put it in a quiescent state, the code should be prepared for errors to occur.

PCI Hot Plug Detach Operation

A PCI Hot Plug removal operation calls the device driver detach() function registered for the device being removed. That driver must provide a complete detach() function that can terminate the device while the system is running. A device driver must indicate that it supports the PCI Hot Plug removal by setting the D_PCI_HOT_PLUG_DETACH flag in its pfxdevflag constant. Only drivers that indicate that they support Hot Plug removal will have their detach() function called when a Hot Plug removal operation targets one of their devices.

The device termination process includes releasing any software resources that are allocated to the device and setting the device hardware to a state where the device can be powered down. If a Hot Plug removal fails, the driver must leave the device and its software resources in a stable state; the kernel will not try to recover from a failed Hot Plug removal operation.

The detach() function returns a status code that indicates if it was successful or not. A nonzero code from sys/errno.h indicates the specific error and the device is marked as having an incomplete shutdown. An incomplete shutdown (Hot Plug removal) operation can be retried, so the driver should leave the device and its software resources in a state where a subsequent attempt to remove (shutdown) the device can succeed.

Open and Close Entry Points

The pfxopen() and pfxclose() entries for block and character devices are called when a device comes into use and when use of it is finished. For a conceptual overview of the open() process, see “Overview of Device Open” in Chapter 3.

Entry Point open()

The kernel calls a device driver's pfxopen() entry when a process executes the open() system call on any device special file (see the open(2) reference page). It is also called when a process executes the mount() system call on a block device (see the mount(2) reference page). (For the pfxopen() entry point of a STREAMS driver, see “Entry Point open()” in Chapter 22.)

The prototype of pfxopen() is as follows:

int pfxopen(dev_t *devp, int oflag, int otyp, cred_t *crp);

The argument values are

*devp	Pointer to a dev_t value, actually a handle to a leaf vertex in the hwgraph.
otyp	An integer flag specifying the source of the call: a user process opening a character device or block device, or another driver.
oflag	Flag bits specifying user mode options on the `open()` call.
crp	A cred_t object—an opaque structure for use in authentication. Standard access privileges to the special device file have already been verified.

Note: In releases before IRIX 6.4, a driver's pfxdevflag constant could contain D_OLD. In that case, the first argument to pfxopen() was a dev_t value, not a pointer to a dev_t value. However, this compatibility mode is no longer supported. The first argument to pfxopen() is always a pointer to a dev_t.

The open(D2) reference page discusses the kind of work the pfxopen() entry point can do. In general, the driver is expected to verify that this user process is permitted access in the way specified in otyp (reading, writing, or both) for the device specified in *devp. If access is not allowable, the driver returns a nonzero error code from sys/errno.h, for example ENOMEM or EBUSY.

Use of the Device Handle

The dev_t value input to pfxopen() and all other top-half entry points is the key parameter that specifies the device. You use the dev_t to locate the hwgraph vertex that is being opened. From that vertex you extract the address of the device information structure that was stored when the device was attached (see “Allocating Storage for Device Information”). In pfxopen() or any other top-half entry point, the driver retrieves the device information by applying device_info_get() to the dev_t value (see “Interrogating the hwgraph” in Chapter 8).

Use of the Open Type

The otyp flag distinguishes between the following possible sources of this call to pfxopen() (the constants are defined in sys/open.h).

a call to open a character device (OTYP_CHR)
a call to open a block device (OTYP_BLK)
a call to a mount a block device as a filesystem (OTYP_MNT)
a call to open a block device as swapping device (OTYP_SWP)
a call direct from a device driver at a higher level (OTYP_LYR)

Typically a driver is written only to be a character driver or a block driver, and can be called only through the switch table for that type of device. When this is the case, the otyp value has little use.

It is possible to have the same driver treated as both block and character, in which case the driver needs to know whether the open() call addressed a block or character special device. It is possible for a block device to support different partitions with different uses, in which case the driver might need to record the fact that a device has been mounted, or opened as a swap device.

With all open types except OTYP_LYR, pfxopen() is called for every open or mount operation, but pfxclose() is called only when the last close or unmount occurs. The OTYP_LYR feature is used almost exclusively by drivers distributed with IRIX, like the host adapter SCSI driver (see “Host Adapter Concepts” in Chapter 16). For each open of this type, there is one call to pfxclose().

Use of the Open Flag

The interpretation of the open mode flags is up to the designer of the driver. Four modes can be requested (declared in sys/file.h):

FREAD	Input access wanted.
FWRITE	Output access wanted (both FREAD and FWRITE may be set, corresponding to O_RDWR mode).
FNDELAY or FNONBLOCK	Return at once, do not sleep if the open cannot be done immediately.
FEXCL	Request exclusive use of the device.

You decide which of the flags have meaning with respect to the abilities of this device. You can return an EINVAL error when an unsupported mode is requested.

A key decision is whether the device can be opened only by one process at a time, or by multiple processes. If multiple opens are supported, a process can still request exclusive access with the FEXCL mode.

When the device can be used by only one process, or when FEXCL access is supported, the driver must keep track of the fact that the device is open. When the device is busy, the driver can test the FNDELAY and FNONBLOCK flags; if either is set, it can return EBUSY. Otherwise, the driver should sleep until the device is free; this requires coordination with the pfxclose() entry point.

Use of the cred_t Object

The cred_t object passed to pfxopen(), pfxclose(), and pfxioctl() can be used with the drv_priv() function to find out if the effective calling user ID is privileged or not (see the drv_priv(D3) reference page). Do not examine the object in detail, since its contents are subject to change from release to release.

Saving the Size of a Block Device

In a block device driver, the pfxsize() entry point will be called soon after pfxopen() (see “Entry Point size()”). It is typically best to calculate or read the device capacity at open time, and save it to be reported from pfxsize().

Completing the hwgraph

Some device drivers distributed with IRIX test, at open time, to see if this is the first open since the attachment of the specified device. For these devices, the first open() call is guaranteed to come from the ioconfig program after it has assigned a stable controller number (see “Using ioconfig for Global Controller Numbers” in Chapter 2). When these drivers detect the first open for a device, they retrieve the assigned controller number from the device vertex using device_controller_num_get() (see hwgraph.inv(d3x) , and possibly add convenience vertexes to the hwgraph.

Entry Point close()

The kernel calls the pfxclose() entry when the last process calls close() or umount() for the device special file. It is important to know that when the device can be opened by multiple processes, pfxclose() is not called for every close() function, but only when the last remaining process closes the device and no other processes have it open. The function prototype and arguments of pfxclose() are

int pfxclose(dev_t dev, int flag, int otyp, cred_t *crp);

The arguments are the same as were passed to pfxopen(). However, the flag argument is not necessarily the same as at any particular call to open().

It is up to you to design the meaning of “close” for this type of device. The close(D2) reference page discusses some of the actions the driver can do. Some considerations are:

If the device is opened and closed frequently, you may decide to retain dynamic data structures.
If the device can perform an action such as “rewind” or “eject,” you decide whether that action should be done upon close. Possibly the choice of acting or not acting can be set by an ioctl() call; or possibly the choice can be encoded into the device minor number—for example, the no-rewind-on-close option is encoded in certain tape minor device numbers.
If the pfxopen() entry point supports exclusive access, and it can be waiting for the device to be free, pfxclose() must release the wait.

When a device can do DMA, the pfxclose() entry point is the appropriate place to make sure that all I/O has terminated. Since all processes have closed the device, there is no reason for it to continue transmitting data into memory; and if it does continue, it might corrupt memory.

The pfxclose() entry can detect an error and report it with a return code. However, the file is closed or unmounted regardless.

Control Entry Point

The pfxioctl() entry point is called by the kernel when a user process executes the ioctl() system call (see the ioctl(2) reference page). This entry point is allowed in character drivers only. Block device drivers do not support it. STREAMS drivers pass control information as messages.

For an overview of the relationship between the user process, kernel, and the control entry point, see “Overview of Device Control” in Chapter 3.

The prototype of the entry point is

int pfxioctl(dev_t dev, int cmd, void *arg,
            int mode, cred_t *crp, int *rvalp);

The argument values are

dev	A dev_t value from which you can extract the major and minor device numbers, or the device information from the hwgraph vertex.
cmd	The request value specified in the `ioctl()` call.
arg	The optional argument value specified in the `ioctl()` call, or NULL if none was specified.
mode	Flag bits specifying the `open()` mode, as associated with the file descriptor passed to the `ioctl()` system function.
crp	A cred_t object—an opaque structure for use in authentication, describing the process that is in-context. Standard access privileges to the special device file have already been verified.
*rvalp	The integer result to be returned to the user process.

It is up to the device driver to interpret the cmd and arg values in the light of the mode and other arguments. When the arg value is a pointer to data in the process address space, the driver uses the copyin() kernel function to copy the data into kernel space, and the copyout() function to return updated values. (See the copyin(D3) and copyout(D3) reference pages, and also “Transferring Data” in Chapter 8.)

Choosing the Command Numbers

The command numbers supported by pfxioctl() are arbitrary; but the recommended practice is to make sure that they are different from those of any other driver. One method to achieve this is suggested in the ioctl(D2) reference page.

Supporting 32-Bit and 64-Bit Callers

The ioctl() entry point may need to interpret a structure prepared in the user process. In a 64-bit system, the user process can be either a 32-bit or a 64-bit program. For discussion of this issue, see “Handling 32-Bit and 64-Bit Execution Models”.

User Return Value

The kernel returns 0 to the ioctl() system function unless the pfxioctl() function returns an error code. In the event of an error, the kernel may also return the code the driver places in *rvalp, if any, or -1. To ensure that the user process sees a specific error code, it is a good idea to set the code in *rvalp, and return that value. If your device driver does not define a pfxdevflag or sets it to D_OLD, see “Driver Flag Constant”.

Data Transfer Entry Points

The pfxread() and pfxwrite() entry points are supported by character device drivers and pseudo-device drivers that allow reading and writing. They are called by the kernel when the user process calls the read(), readv(), write(), or writev() system function.

The pfxstrategy() entry point is required of block device drivers. It is called by the kernel when either a filesystem or the paging subsystem needs to transfer a block of data.

Entry Points read() and write()

The pfxread() and pfxwrite() entry points are similar to each other—only the direction of data transfer differs. The prototypes of the functions are

int pfxread (dev_t dev, uio_t *uiop, cred_t *crp);
int pfxwrite(dev_t dev, uio_t *uiop, cred_t *crp);

The arguments are

dev	A dev_t value from which you can extract the major and minor device numbers, or the device information from the hwgraph vertex.
*uiop	A uio_t object—a structure that defines the user's buffer memory areas.
crp	A cred_t object—an opaque structure for use in authentication. Standard access privileges to the special device file have already been verified.

Data Transfer for a PIO Device

A character device driver using PIO transfers data in the following steps:

If there is a possibility of a timeout, start a timeout delay (see “Waiting for Time to Pass” in Chapter 8).
Initiate the device operation as required.
Transfer data between the device and the buffer represented by the uio_t (see “Transferring Data Through a uio_t Object” in Chapter 8).
If it is necessary to wait for an interrupt, put the process to sleep (see “Waiting and Mutual Exclusion” in Chapter 8).
When data transfer is complete, or when an error occurs, clear any pending timeout and return the final status of the operation. If the return code is 0, the final state of the uio_t determines the byte count returned by the read() or write() call.

Calling Entry Point strategy() From Entry Point read() or write()

A device driver that supports both character and block interfaces must have a pfxstrategy() routine in which it performs the actual I/O.

For example, the IRIX disk drivers support both character and block driver interfaces, and perform all I/O operations in the pfxstrategy() function. However, the pfxread(), pfxwrite() and pfxioctl() entries supported for character-type access also need to perform I/O operations. They do this by calling the pfxstrategy() routine indirectly, using the kernel function physiock() or uiophysio() (see the physiock(D3) and uiophysio(D3) reference pages, and see “Waiting for Block I/O to Complete” in Chapter 8).

Both the physiock() and uiophysio() functions takes care of the housekeeping needed to interface to the pfxstrategy() entry, including the work of allocating a buffer and a buf_t structure, locking buffer pages in memory and waiting for I/O completion. Both routines require the uio_t to describe only a single segment of data (uio_iovcnt of 1). Although they are very similar, the two functions differ in the following ways:

physiock() returns EINVAL if the initial offset is not a multiple of 512 bytes. If this is a requirement of your pfxstrategy() routine, use physiock(); if not, use uiophysio().
physiock() is compatible with SVR4, while uiophysio() is unique to IRIX.

Example 7-3 shows the skeleton of a hypothetical driver in which the pfxread() entry does its work through the pfxstrategy() entry.

Example 7-3. Hypothetical pfxread() entry in a Character/Block Driver

hypo_read (dev_t dev, uio_t *uiop, cred_t *crp)
{
   // ...validate the operation... //
   return physiock(hypo_strategy, /* our strategy entry */
                  0,   /* allocate temp buffer & buf_t */
                  dev, /* dev_t arg for strategy */
                  B_READ, /* direction flag for buf_t */
                  uiop);
}

The pfxwrite() entry would be identical except for passing B_WRITE instead of B_READ.

This dual-entry strategy is required only in a driver that supports both character and block access.

Entry Point strategy()

A block device driver does not directly support system calls by user processes. Instead, it provides services to a filesystem such as XFS, or to the memory paging subsystem of IRIX. These subsystems call the pfxstrategy() entry point to read data in whole blocks.

Calls to pfxstrategy() are not directly related in time to system functions called by a user process. For example, a filesystem may buffer many blocks of data in memory, so that the user process may execute dozens or hundreds of write() calls without causing an entry to the device driver. When the user function closes the file or calls fsync()—or when for unrelated reasons the filesystem needs to free some buffers—the filesystem calls pfxstrategy() to write numerous blocks of data.

In a driver that supports the character interface as well, the pfxstrategy() entry can be called indirectly from the pfxread(), pfxwrite() and pfxioctl() entries, as described under “Calling Entry Point strategy() From Entry Point read() or write()”.

The prototype of the pfxstrategy() entry point is

int pfxstrategy(struct buf *bp);

The argument is the address of a buf_t structure, which gives the strategy routine the information it needs to perform the I/O:

The dev_t, from which the driver can get major and minor device numbers or the device information from the hwgraph vertex
The direction of the transfer (read or write)
The location of the buffer in kernel memory
The amount of data to transfer
The starting block number on the device

For more on the contents of the buf_t structure, see “Structure buf_t” in Chapter 8 and the buf(D4) reference page.

The driver uses the information in the buf_t to validate the data transfer and programs the device to start the transfer. Then it stores the address of the buf_t where the interrupt handler can find it (see “Interrupt Entry Point and Handler”) and calls biowait() to wait for the operation to complete. For the next step, see “Completing Block I/O” (see also the biowait(D3) reference page).

Poll Entry Point

The pfxpoll() entry point is called by the kernel when a user process calls the poll() or select() system function asking for status on a character special device. To implement it, you need to understand the IRIX implementation of poll().

Use and Operation of poll(2)

The IRIX version of poll() allows a process to wait for events of different types to occur on any combination of devices, files, and STREAMS (see the poll(2) and select(2) reference pages). It is possible for multiple processes to be waiting for events on the same device.

It is up to you as the designer of a driver to decide which of the events that are documented in poll(2) are meaningful for your device. Other requested events simply never happen to the device.

Much of the complexity of poll() is handled by the IRIX kernel, but the kernel requires the assistance of any device driver that supports poll(). The driver is expected to allocate and hold a pollhead structure (declared in sys/poll.h) for each minor device that it supports. Allocation is simple; the driver merely calls the phalloc() kernel function. (The pfxstart() entry point is a suitable place for this call; see “Entry Point start()”.)

There are two phases to the operation of poll(). When the system function is called, the kernel calls the pfxpoll() entry point to find out if any requested events are pending at this time. If the kernel finds any event s pending (on this or any other polled object), the poll() function returns to the user process. Nothing further is required.

However, when no requested event has happened, the user process expects the poll() function to block until an event has occurred. The kernel must implement this delay. It would be too inefficient for the kernel to repeatedly test for events. The kernel must rely on device drivers to notify it when an event has occurred.

Use of pollwakeup()

A device driver that supports pfxpoll() is required to notify the kernel whenever an event that the driver supports has occurred. The driver does this by calling a kernel function, pollwakeup(), passing the pollhead structure for the affected device, and bit flags for the events that have taken place. In the event that one or more user processes are blocked in a poll(), waiting for an event from this device, the call to pollwakeup() will release the sleeping processes. For an example, see “Calling pollwakeup()”.

Use of pollwakeup() Without Interrupts

If the device in question does not support interrupts, the driver cannot support poll() unless it can somehow get control to discover an event and report it to pollwakeup(). One possibility is that the driver could simulate interrupts by setting a succession of itimeout() delays. On each timeout the driver would test its device for a change of status, call pollwakeup() when an event has occurred; and schedule a new delay. (See “Waiting for Time to Pass” in Chapter 8.)

Entry Point poll()

The prototype for pfxpoll() is as follows:

int pfxpoll(dev_t dev, short events, int anyyet,
           short *reventsp, struct pollhead **phpp, 
           unsigned int *genp);

The argument values are

dev	A dev_t value from which you can extract the major and minor device numbers, or the device information from the hwgraph vertex.
events	Bit-flags for the events the user process is testing, as passed to `poll()` and declared in `sys/poll.h`.
*reventsp	A field to receive the bit-flags of events that have occurred, or to receive 0x0000 if no requested events have occurred.
anyyet and *phpp	When anyyet is zero and no events have occurred, the kernel requires the address of the pollhead structure for this minor device to be returned in *phpp.
*genp	A pointer to an unsigned integer that is used by the driver to store the current value of the pollhead's generation number at the time of the poll. (New in IRIX 6.5.)

Example 7-4 shows the pfxpoll() code of a hypothetical device driver. Only three event tests are supported: POLLIN and POLLRDNORM (treated as equivalent) and POLLOUT. The device driver maintains an array of pollhead structures, one for each supported minor device. These are presumably allocated during initialization.

Example 7-4. pfxpoll() Code for Hypothetical Driver

struct pollhead phds[MAXMINORS];
#define OUR_EVENTS (POLLIN|POLLOUT|POLLRDNORM)
hypo_poll(dev_t dev, short events, int anyyet,
          short *reventsp, struct pollhead **phpp, unsigned int *genp)
{
   minor_t dminor = geteminor(dev);
   short happened = 0;
   short wanted = events & OUR_EVENTS;
   *genp = POLLGEN(&phds[dminor])
   if (wanted & (POLLIN|POLLRDNORM))
   {
      if (device_has_data_ready(dminor))
         happened |= (POLLIN|POLLRDNORM);
   }
   if (wanted & POLLOUT)
   {
      if (device_ready_for_output(dminor))
         happened |= POLLOUT;
   }
   if (device_pending_error(dminor))
      happened |= POLLERR;
   if (0 == (*reventsp = happened))
   {
      if (anyyet) *phpp = &phds[dminor]
   }
   return 0;
}

The code in Example 7-4 begins by discarding any unsupported event flags that might have been requested, and passes back the driver's pollhead generation number before probing the device. Then it tests the remaining flags against the device status. If the device has an uncleared error, the code inserts the POLLERR event. If no events were detected, and if the kernel requested it, the address of the pollhead structure for this minor device is returned.

If no requested event has occurred, the process will queue awaiting the requested events, provided that no event has occurred in the interim—before it is able to queue. This is determined by comparing the pollhead generation number at the time of queueing with the pollhead generation number passed back at the initial request. Since a call to pollwakeup() increments the pollhead generation number, any difference in the current pollhead generation number to the one at the time of the initial request indicates a device event has occurred, and the device must be queried again to determine if it was a requested event. If the values of the previous and current pollhead generation numbers are equal, the process queues.

Memory Map Entry Points

A user process requests memory mapping by calling the system function mmap(). When the mapped object is a character device special file, the kernel calls the pfxmmap() or pfxmap() entry to validate and complete the mapping. To understand these entry points, you must understand the mmap() system function.

Concepts and Use of mmap()

The purpose of the mmap() system function (see the mmap(2) reference page) is to make the contents of a file directly accessible as part of the virtual address space of the user process. The results depend on the kind of file that is mapped:

When the mapped object is a normal file, the process can load and store data from the file as if it were an array in memory.
When the mapped object is a character device special file, the process can load and store data from device registers as if they were memory variables.
When the mapped object is a block of memory owned and prepared by a pseudo-device driver, the process gains access to some special piece of memory data that it would not normally be able to access.

In all cases, access is gained through normal load and store instructions, without the overhead of calling system functions such as read(). Furthermore, the same mapping can be executed by other processes, in which case the same memory, or file, or device is shared by multiple, concurrent processes. This is how shared memory segments are achieved.

Use of mmap()

The mmap() system function takes four key parameters:

the file descriptor for an open file, which can be either a normal disk file or a device special file
an offset within that file at which the mapped data is to start. For a normal file, this is a file offset; for a device file, it represents an address in the address space of the device or the bus
the length of data to be mapped
protection flags, showing whether the mapped data is read-only or read-write

When the mapped object is a normal file, the filesystem implements the mapping. The filesystem does not call the block device driver for assistance in mapping a file. It does call the block device driver pfxstrategy() entry to read and write blocks of file data as necessary, but the mapping of pages of data into pages of memory is controlled in the filesystem code.

When the mapped object is a device special file, the mmap() parameters are passed to the device driver at either its pfxmmap() or pfxmap() entry point. The device driver interprets the parameters in the context of the device, and uses a kernel function to create the mapping.

Persistent Mappings

Once a device or kernel memory has been mapped into some user address space, the mapping persists until the user process terminates or calls unmap() (see the unmap(2) reference page). In particular, the mapping does not end simply because the device special file is closed. You cannot assume, in the pfxclose() or pfxunload() entry points, that all mappings to devices have ended.

Entry Point map()

The pfxmap() entry point can be defined in either a character or a block driver (it is the only mapping entry point that a block driver can supply). The function prototype is

int pfxmap(dev_t dev, vhandl_t *vt,
          off_t off, int len, int prot);

The argument values are

dev	A dev_t value from which you can extract both the major and minor device numbers.
vt	The address of an opaque structure that describes the assigned address in the user process address space. The structure contents are subject to change.
off, len	The offset and length arguments passed to `mmap()` by the user process.
prot	Flags showing the access intentions of the user process.

The first task of the driver is to verify that the access specified in prot is allowed. The next task is to validate the off and len values: do they fall in the valid address space of the device?

When the device driver approves of a mapping, it uses a kernel function, v_mapphys(), to establish the mapping. This function (documented in the v_mapphys(D3) reference page) takes the vhandle_t, an address in kernel cached or uncached memory, and a length. It makes the specified region of kernel space a part of the address space of the user process.

For example, a pseudo-device driver that intends to share kernel virtual memory with user processes would first allocate the memory:

caddr_t *kaddr = kmem_alloc (len, KM_CACHEALIGN);

It would then use the address of the allocated memory with the vhandle_t value it had received to map the allocated memory into the user space:

v_mapphys (vt, kaddr, len)

Note: There are no special precautions to take when mapping cached memory into user space, or when mapping device registers or bus addresses. However, you should almost never map uncached memory into user space. The effects of uncached memory access are hardware dependent and differ between multiprocessors and uniprocessors. Among uniprocessors, the IP26 and IP28 CPU modules have highly restrictive rules for the use of uncached memory (see “Uncached Memory Access in the IP26 and IP28” in Chapter 1). In general, mapping uncached memory makes a driver nonportable and is likely to lead to subtle failures that are hard to resolve.

Example 7-5 contains an edited fragment of code from a Silicon Graphics device driver. This pseudo-device driver, whose prefix is flash_, provides access to “flash” PROM in certain computer models. It allows a user process to map the PROM into user space.

Example 7-5. Edited Fragment of flash_map()

int flash_map(dev_t dev, vhandl_t *vt, off_t off, long len)
{
   long offset = (long) off; /*Actual offset in flash prom*/
/* Don't allow requests which exceed the flash prom size */
   if ((offset + len) > FLASHPROM_SIZE)
      return ENOSPC;
/* Don't allow non page-aligned offsets */
   if ((offset % NBPC) != 0)
      return EIO;
/* Only allow mapping of entire pages */
   if ((len % NBPC) != 0)
      return EIO;
   return v_mapphys(vt, FLASHMAP_ADDR + offset, len);
}

Note: Because there is no way for a driver to retract a successful call to v_mapphys(), your driver must return success to a pfxmap() call if v_mapphys() succeeds. In other words, you should make the call to v_mapphys() the last part of your pfxmap() routine, and only call it if you have determined that there have been no errors in any previous part of this routine. If there have been errors, the routine should return an error and not call v_mapphys(). If there have been no errors, then pfxmap() can return error or success based on the call to v_mapphys().

When the driver allocates some memory resource associated with the mapping, and when more than one mapping can be active at a time, the driver needs to tag each memory resource so it can be located when the pfxunmap() entry point is called. One answer is to use the v_gethandle() macro defined in ksys/ddmap.h. This macro takes a pointer to a vhandle_t and returns a unique pointer-sized integer that can be used to tag allocations. No other information in ksys/ddmap.h is supported for driver use.

Entry Point mmap()

The pfxmmap() (note: two letters “m”) entry can be used only in a character device driver. The prototype is

int pfxmmap(dev_t dev, off_t off, int prot);

The argument values are

dev	A dev_t value from which you can extract both the major and minor device numbers.
off	The offset argument passed to `mmap()` by the user process.
prot	Flags showing the access intentions of the user process.

The function is expected to return the page frame number (PFN) that corresponds to the offset off in the device address space. A PFN is an address divided by the page size. (See “Working With Page and Sector Units” in Chapter 8 for page unit conversion functions.)

This entry point is supported only for compatibility with SVR4. When the kernel needs to map a character device, it looks first for pfxmap(). It calls pfxmmap() only when pfxmap() is not available. The differences between the two entry points are as follows:

This entry point receives no vhandl_t argument, so it cannot use v_mapphys(). It must calculate a page frame number, which means that it has to be aware of the current page size, obtainable from the ptob() kernel function, see ptob(D3) .
This entry point does not receive a length argument, so it has to assume a default length for every map (typically the page size).
When a mapping is created with this entry point, the pfxunmap() entry is not called.

Entry Point unmap()

The kernel calls the pfxunmap() entry point after a mapping is created using the pfxmap() entry point. This entry should be supplied, even if it is an empty function, when the pfxmap() entry point is supplied. If it is not supplied, the munmap() system function returns the ENODEV error to the user process.

The pfxunmap() entry point is only called when the mapped region has been completely unmapped by all processes. For example, suppose a parent process calls mmap() to map a device. Then the parent creates one or more child processes using sproc(). Each child shares the address space, including the mapped segment. A process in the share group can terminate, or can explicitly unmap() the segment or part of the segment, but these actions do not result in a call to pfxunmap(). Only when the last process with access to the segment has fully unmapped the segment is pfxunmap() called.

On entry, the kernel has completed unmapping the object from the user process address space. This entry point does not need to do anything to affect the user address space; it only needs to release any resources that were allocated to support the mapping. The prototype is

int pfxunmap(dev_t dev, vhandl_t *vt);

The argument values are

dev	A dev_t value from which you can extract both the major and minor device numbers.
vt	The address of an opaque structure that describes the assigned address in the user process address space.

If the driver allocated no resources to support a mapping, no action is needed here; the entry point can consist of a “return 0” statement.

When the driver does allocate memory or a PIO map to support a mapping, and supports multiple mappings, the driver needs to identify the resource associated with this particular mapping in order to release it. The vt_gethandle() function returns a unique number based on the vt argument; this can be used to identify resources.

Interrupt Entry Point and Handler

In traditional UNIX, when a hardware device presents an interrupt, the kernel locates the device driver for the device and calls the pfxintr() entry point (see the intr(D2) reference page). In current practice, an entry point named pfxintr() is not given any special treatment—although driver writers often give this name to the interrupt-handling function even so.

A device driver must register a specific interrupt handler for each device. The kernel functions for doing this are bus-specific, and are discussed in the bus-specific chapters. For example, the means of registering a VME interrupt handler is discussed in Chapter 13, “Services for VME Drivers on Origin 2000/Onyx2”. However, the discussion of interrupts that follows is still relevant to any interrupt handler.

In principle an interrupt can happen at any time. Normally an interrupt occurs because at some previous time, the driver initiated a device operation. Some devices can interrupt without a preceding command.

Associating Interrupt to Driver

The association between an interrupt and the driver is established in different ways depending on the hardware.

The VECTOR statement establishes the interrupt level and the associated driver for devices on the EISA and VME busses.
For some VME devices, the interrupt level is set dynamically using vme_ivec_set() (see Chapter 13, “Services for VME Drivers on Origin 2000/Onyx2”).
For devices on the SCSI bus, all interrupts are handled by a single, low-level driver which notifies a callback function (see Chapter 16, “SCSI Device Drivers”).
For devices on the PCI bus, the driver registers an interrupt handler using pci_intr_connect() at the time the device is attached (“Interrupt Signal Distribution” in Chapter 20).

In all cases, the driver specifies the interrupt handler as the address of a function to be called, with an argument to be passed to the function when it is called. This argument value is typically the address of a data structure in which the driver has stored information about the device. Alternatively, it could be the dev_t that names the device—from which the interrupt handler can get device information, see “Allocating Storage for Device Information”.

Interrupt Handler Operation

In general, the interrupt handler implements the following tasks.

When the driver supports multiple logical units, use its argument value to locate the data structure for the interrupting unit.
Determine the reason for the interrupt by interrogating the device. This is typically done with PIO loads and stores of device registers.
When the interrupt is a response to an I/O operation, note the success or failure of the operation.
When the driver top half is waiting for the interrupt, waken it.
If the driver supports polling, and the interrupt represents a pollable event, call pollwakeup().
If the device is not in an error state and another operation is waiting to be started, start it.

The details of each of these tasks depends on the hardware and on the design of the data structures used by the driver top half. In general, you should design an interrupt handler so that it does the least possible and exits as quickly as possible.

Completing Block I/O

In a block device driver, an I/O operation is represented by the buf_t structure. The pfxstrategy() routine starts operations and waits for them to complete (see “Entry Point strategy()”).

The interrupt entry point sets the residual count in b_resid. It can post an error using bioerror(). It posts the operation complete and wakens the pfxstrategy() routine by calling biodone(). If the pfxstrategy() entry has set the address of a completion callback function in the b_iodone field of the buf_t, biodone() invokes it. (For more discussion, see “Waiting for Block I/O to Complete” in Chapter 8.)

Completing Character I/O

In a character device driver, the driver top half typically awaits an interrupt by sleeping on a semaphore or synchronizing variable, and the interrupt routine posts the semaphore (see “Waiting for a General Event” in Chapter 8). Error information must be passed in driver variables according to some local convention.

Calling pollwakeup()

When the interrupt represents an event that can be reported by the driver's pfxpoll() entry point (see “Entry Point poll()”), the interrupt handler must report the event to the kernel, in case some user process is waiting in a poll() call. Hypothetical code to do this is shown in Example 7-6.

Example 7-6. Hypothetical Call to pollwakeup()

hypo_intr(int ivec)
{
   struct hypo_dev_info *pinfo;
   if (! pinfo = find_dev_info(ivec))
      return; /* not our device */
   ...
   if (pinfo->have_data_flag)
      pollwakeup (pinfo->phead, POLLIN, POLLRDNORM);
   if (pinfo->output_ok_flag)
      pollwakeup (pinfo->phead, POLLOUT);
   ...

Note: It's important that the call to pollwakeup() occurs after any state has been updated by the event interrupt routine.

Interrupts as Threads

In traditional UNIX design, and in versions of IRIX preceding IRIX 6.4, an interrupt is handled as an asynchronous trap. The hardware trap handler calls the driver's interrupt function as a subroutine. In these systems, when the interrupt handler code is entered the system is in an unknown state. As a result, the interrupt handler can use only a restricted set of kernel services, and no services that can sleep.

Starting with IRIX 6.4, the IRIX kernel does all its work under control of lightweight executable entities called “kernel threads.” When a device driver registers an interrupt handler, the kernel allocates a thread to execute that handler. The thread begins execution by waiting on an internal semaphore.

When a hardware interrupt occurs, the trap code merely posts the semaphore on which the handler's thread is waiting. Soon thereafter, the interrupt thread is scheduled to execute, and it calls the function registered by the driver.

The differences from previous releases are small. It is still true that the interrupt handler code is entered unpredictably, at a high priority; does little; and exits quickly. However, there are the following differences compared to earlier systems:

The interrupt handler can be preempted by kernel threads running at higher priorities.

Previously, an interrupt handler in a uniprocessor system could only be preempted by an interrupt from a device with higher hardware priority. In IRIX 6.4, the handler can be preempted by kernel threads running daemons and high-priority real-time tasks, in addition to other interrupt threads.
There are no restrictions on the kernel services an interrupt handler may call.

In particular, the interrupt handler is permitted to call services that could sleep. However, this is still typically not a good idea. For example, an interrupt handler should almost never allocate memory.
Mutual exclusion between the interrupt handler the driver top half can be done with mutex locks, instead of requiring the use of spinlocks.
The handler can do more work, and more elaborate work, if that leads to a better design for the driver.

In IRIX 6.4, the driver writer has no control over the selection of interrupt thread priority. The kernel assigns a high relative priority to threads that execute interrupt handlers. However, higher priorities exist, and an interrupt thread can be preempted.

Mutual Exclusion

In historical UNIX systems, which were uniprocessor systems, when the only CPU is executing the interrupt handler, nothing else is executing. The hardware architecture ensured that top-half code could not preempt the interrupt handler; and the top half could use functions such as splhi() to block interrupts during critical sections (see “Priority Level Functions” in Chapter 8). An interrupt handler could only be preempted by an interrupt of higher priority—which would be an interrupt for a different driver, and so would have no conflicts with this driver over the use of data.

None of these comfortable assumptions are tenable any longer.

Hardware Exclusion Is Ineffective

In a multiprocessor, an interrupt can be taken on any CPU, while other CPUs continue to execute kernel or user code. This means that the top half code cannot block out interrupts using a function such as splhi(), because the interrupt could be taken on another CPU. Nor can the interrupt handler assume that it is safe; another CPU could be executing a top-half entry point to the same driver, for the same device, as an interrupt handler.

With the threaded kernel of IRIX 6.4, it is even possible for a process with an extremely high priority, in the same CPU (or in the only CPU of a uniprocessor), to enter the driver top half, preempting the thread that is executing the interrupt handler.

It is theoretically possible in a threaded kernel for a device to interrupt; for the kernel thread to be scheduled and enter the interrupt handler; and for the device to interrupt again, resulting in multiple concurrent entries to the same interrupt handler. However, IRIX prevents this. The interrupt handler for a device is entered serially. If you register the same handler function for multiple devices, it can be entered concurrently when different devices present interrupts at the same time.

Using Locking Between Top and Bottom Half

The only solution possible is that you must use a software lock of some kind to protect the data objects that can be accessed concurrently by top-half code and the interrupt handler. Before using that shared data, a function must acquire the lock. Options for the type of lock are discussed under “Designing for Multiprocessor Use”.

Interrupt Performance and Latency

Another interrupt cannot be handled from the same device until the interrupt handler function returns. The interrupt thread runs at very nearly the highest priority, so all but the most essential work is suspended in the interrupted CPU until the handler returns.

Support Entry Points

Certain driver entry points are used to support the operations of the kernel or the administration of the system.

Entry Point unreg()

The pfxunreg() entry point is called in a loadable driver, prior to the call to the pfxunload() entry point. This entry point is used by drivers that support the pfxattach() entry point (see “Attach and Detach Entry Points ”). Such drivers have to register with the kernel as supporting devices of certain types. Before unloading, a driver needs to retract this registration, so the kernel will not call the driver to attach another device.

If pfxunreg() returns a nonzero error code, the driver is not unloaded.

Entry Point unload()

The pfxunload() entry point is called when the kernel is about to dynamically remove a loadable driver from the running system. A driver can be unloaded either because all its devices are closed and a timeout has elapsed, or because the operator has used the ml command (see the ml(1) reference page). The kernel calls pfxunload() only when no device special files managed by the driver are open. If any device had been opened, the pfxclose() entry has been called.

It is not easy to retain state information about the device over the time when the driver is not in memory. The entire text and data of a loadable driver, including static variables, are removed and reloaded. Global variables defined in the descriptive file (see “Describing the Driver in /var/sysgen/master.d” in Chapter 9) remain in memory after the driver is unloaded, as do any allocated memory addressed from a hwgraph vertex (see “Attaching Information to Vertexes” in Chapter 8). Be sure not to store any addresses of driver code or driver static variables in global variables or vertex structures, since the driver addresses will be different when the driver is reloaded.

Other than data addressed from the hwgraph, allocated dynamic memory should be released. The driver should also release any process handles (see “Sending a Process Signal” in Chapter 8).

The driver is not required to unload. If the driver should not be unloaded at this time, it returns a nonzero return code to the call, and the kernel does not unload it. There are several reasons why a driver should not be unloaded.

A driver should never permit unloading when there is any kind of pointer to the driver held in any kernel data structure. It is a frequent design error to unload when there is a live pointer to the driver. Unpredictable kernel panics often result.

One example of a live pointer to a driver is a pending callback function. Any pending itimeout() or bufcall() timers should be cancelled before returning 0 from pfxunload(). Another example is a registered interrupt handler. The driver must disconnect any interrupt handler before unloading; or else refuse to unload.

Entry Point halt()

The kernel calls the pfxhalt() entry point, if one exists, while performing an orderly system shutdown (see the halt(1) reference page). No other driver entry points are called after this one. The prototype is simply

void pfxhalt(void);

Since the system is shutting down, there is no point in returning allocated memory. The only purpose this entry point can serve is to leave the device in a safe and stable condition. For example, this is the place at which a disk driver could command the heads of the drive to move to a safe zone for power off.

The driver cannot assume that interrupts are disabled or enabled. The driver cannot block waiting for device actions, so whatever commands it issues to the device must take effect immediately.

Entry Point size()

The pfxsize() entry point is required of block device drivers. It reports the size of the device in “sector” units, where a “sector” size is declared as NBPSCTR in sys/param.h (currently 512). The prototype is

int pfxsize(dev_t dev);

The device major and minor numbers can be extracted from the dev argument. The entry point is not called until pfxopen() has been called. Typically the driver will calculate the size of the medium during pfxopen().

Since the int return value is 32 bits in all systems, the largest possible block device is 1,024 gigabytes ((2³¹*512)/1,024³).

Entry Point print()

The pfxprint() entry point is called from the kernel to display a diagnostic message when an error is detected on a block device. The prototype and the complete logic of the entry point is shown in Example 7-7.

Example 7-7. Entry Point pfxprint()

#include <sys/cmn_err.h>
#include <sys/ddi.h>
int hypo_print(dev_t dev, char *str)
{
   cmn_err(CE_NOTE,"Error on dev %d: %s\n",geteminor(dev),str);
   return 0;
}

Handling 32-Bit and 64-Bit Execution Models

The pfxioctl() entry point can be passed a data structure from the user process address space; that is, the arg value can be a pointer to a structure or an array of data. In order to interpret such a structure, the driver has to know the execution model for which the user process was compiled.

The execution model is specified when code is compiled. The 32-bit model (compiler option -32 or -n32) uses 32-bit address values and a long int contains 32 bits. The 64-bit model (compiler option -64) uses 64-bit address values and a long int contains 64 bits. (The size of an unqualified int is 32 bits in both models.) The execution model is sometimes casually called the “ABI” (Authorized Binary Interface), but this is an improper use of that term—an ABI comprises calling conventions, public names, and structure definitions, as well as the execution model.

An IRIX kernel compiled to the 32-bit model contains 32-bit drivers and supports only 32-bit user processes. A kernel compiled to the 64-bit model contains 64-bit drivers, but it supports user processes compiled to either 32-bit or 64-bit models. Therefore, in a 64-bit kernel, a driver can be asked to interpret data produced by a 32-bit program.

This is true only of the pfxioctl() entry point. Other driver entry points move data to and from user space as streams or blocks of bytes—not as a structure with fields to be interpreted.

Since in other respects it is easy to make your driver portable between 64-bit and 32-bit systems, you should design your driver so that it can handle the case of operating in a 64-bit kernel, receiving ioctl() requests alternately from 32-bit and 64-bit programs.

The simplest way to do this is to define the arguments passed to the entry points in such a way that they have the same precision in either system. However, this is not always possible. To handle the general case, the driver must know to which model the user process was compiled.

In any top-half entry point (where there is a user process context), you find this out by calling the userabi() function (for which there is no reference page available). The prototype of userabi() (declared in sys/ddi.h) is

int userabi(__userabi_t *);

If there is no user process context, userabi() returns ESRCH. Otherwise it fills out a __userabi_t structure and returns 0. The structure of type __userabi_t (declared in sys/types.h) contains the fields listed below:

uabi_szint	Size of a user int (4).
uabi_szlong	Size of a user long (4 or 8).
uabi_szptr	Size of a user address (4 or 8).
uabi_szlonglong	Size of a user long long (8).

Store the value of uabi_szptr when opening a device. Then you can use it to choose between 32-bit and 64-bit declarations of a structure passed to pfxioctl() or an address passed to pfxpoll().

In any part of the driver, including interrupt threads, you can get the current ABI by calling the kernel function get_current_abi(). It takes no argument. It returns an unsigned character value that can be decoded using macros and constants that are declared in the header file sys/kabi.h.

Designing for Multiprocessor Use

Multiprocessor computers are central to the Silicon Graphics product line and are becoming increasingly common. A device driver that is not multiprocessor-ready can be used in a multiprocessor, but it is likely to cause a performance bottleneck. By contrast, a multiprocessor-ready driver works well in a uniprocessor with no cost in performance.

The Multiprocessor Environment

A multiprocessor has two or more CPU modules, all of the same type. The CPUs execute independently, but all share the same main memory. Any CPU can execute the code of the IRIX kernel, and it is common for two or more CPUs to be executing kernel code, including driver code, simultaneously.

Uniprocessor Assumptions

Traditional UNIX architecture assumes a uniprocessor hardware environment with a hierarchy of interrupt levels. Ordinary code could be preempted by an interrupt, but an interrupt handler could only be preempted by an interrupt at a higher level.

This assumed hardware environment was reflected in the design of device drivers and kernel support functions.

In a uniprocessor, an upper-half driver entry point such as pfxopen() cannot be preempted except by an interrupt. It has exclusive access to driver variables except for those changed by the interrupt handler.
Once in an interrupt handler, no other code can possibly execute except an interrupt of a higher hardware level. The interrupt handler has exclusive access to driver variables.
The interrupt handler can use kernel functions such as splhi() to set the hardware interrupt mask, blocking interrupts of all kinds, and thus getting exclusive access to all memory including kernel data structures.

All of these assumptions fail in a multiprocessor.

Upper-half entry points can be entered concurrently on multiple CPUs. For example, one CPU can be executing pfxopen() while another CPU is in pfxstrategy(). Exclusive use of driver variables cannot be assumed.
An interrupt can be taken on one CPU while upper-half routines or a timeout function execute concurrently on other CPUs. The interrupt routine cannot assume exclusive use of driver variables.
Interrupt-level functions such as splhi() are meaningless, since at best they set the interrupt mask on the current CPU only. Other CPUs can accept interrupts at all levels. The interrupt handler can never gain exclusive access to kernel data.

The process of making a driver multiprocessor-ready consists of changing all code whose correctness depends on uniprocessor assumptions.

Protecting Common Data

Whenever a common resource can be updated by two processes concurrently, the resource must be protected by a lock that represents the exclusive right to update the resource. Before changing the resource, the software acquires the lock, claiming exclusive access. After changing the resource, the software releases the lock.

The IRIX kernel provides a set of functions for creating and using locks. It provides another set of functions for creating and using semaphore objects, which are like locks but sometimes more flexible. Both sets of functions are discussed under “Waiting and Mutual Exclusion” in Chapter 8.

Sleeping and Waking

Sometimes the lock is not available—some other process executing in another CPU has acquired the lock. When this happens, the requesting process is delayed in the lock function until the lock is free. To delay, or sleep, is allowed for upper-half entry points, because they execute (in effect) as subroutines of user processes.

Interrupt handlers and timeout functions are not permitted to sleep. They have no process identity and so there is no mechanism for saving and restoring their state. An interrupt handler can test a lock, and can claim the lock conditionally, but if a lock is already held, the handler must have some alternate way of storing data.

Synchronizing Within Upper-Half Functions

When designing an upper-half entry point, keep in mind that it could be executed concurrently with any other upper-half entry point, and that the one entry point could even be executed concurrently by multiple CPUs. Only a few entry points are immune:

The pfxinit(), pfxedtinit(), and pfxstart() entry points cannot be entered concurrently with each other or any other entry point (pfxstart() could be entered concurrently with the interrupt handler).
The pfxunload() and pfxhalt() entry points cannot be entered concurrently with any other entry point except for stray interrupts.
Certain entry points have no cause to use shared data; for example, pfxsize() and pfxprint() normally do not need to take any precautions.

Other upper-half entry points, and all STREAMS entry points, can be entered concurrently by multiple CPUs, when the driver is multiprocessor-aware. In earlier versions of IRIX, you could place a flag in the pfxdevflag of a character driver that made the kernel run the driver only on CPU 0. This effectively serialized all use of that driver. That feature is no longer supported. You must deal with concurrency.

Serializing on a Single Lock

You can create a single lock for upper-half serialization. Each upper-half function begins with read-only operations such as extracting the device minor number, getting device information from the hwgraph vertex, and testing and validating arguments. You allow these to execute concurrently on any CPU.

In each entry point, when the preliminaries are complete, you acquire the single lock, and release it just before returning. The result is that processes are serialized for I/O through the driver. If the driver supports only a single device, processes would be serialized in any case, waiting for the device to operate. Since the upper half can execute on any CPU, latency is more predictable.

Serializing on a Lock Per Device

When the driver supports multiple minor devices, you will normally have a data structure per device. An upper-half routine is concerned only with one device. You can define a lock in the data structure for each device instance, and acquire that lock as soon as the device information structure is known.

This method permits concurrent execution of upper-half requests for different minor devices, but it serializes access to any one device.

Coordinating Upper-Half and Interrupt Entry Points

Upper-half entry points prepare work for the device to do, and the interrupt handler reports the completion of the device action (see “Interrupt Handler Operation”). In a block device driver, this communication is relatively simple. In a character driver, you have more design options. The kernel functions mentioned in the following topics are covered under “Waiting and Mutual Exclusion” in Chapter 8.

Coordinating Through the buf_t

In a block device driver, the pfxstrategy() routine initiates a read or a write based on a buf_t structure (see “Entry Point strategy()”), and leaves the address of the buf_t where the interrupt routine can find it. Then pfxstrategy() calls the biowait() kernel function to wait for completion.

The pfxintr() entry point updates the buf_t (using pfxbioerror() if necessary) and then uses biodone() to mark the buf_t as complete. This ends the wait for pfxstrategy().

Coordination in a Character Driver

In a character driver that supports interrupts, you design your own coordination mechanism. The simplest (and not recommended) would be based on using the kernel function sleep() in the upper half, and wakeup() in the interrupt routine. It is better to use a semaphore and use psema() in the upper half and vsema() in the interrupt handler.

If you need to allow for timeouts in addition to interrupts, you have to deal with the complication that the timeout function can be called concurrently with an interrupt. In this case it is better to use synchronization variables (see “Using Synchronization Variables” in Chapter 8).

Choice of Lock Type

In versions before IRIX 6.4, interrupt handlers must not use kernel services that can sleep. This prevented you from using normal locks to provide mutual exclusion between the upper half and the interrupt handler. The lock had to be a basic lock (see “Basic Locks” in Chapter 8), a type that is implemented as a spinning lock in a multiprocessor.

Now that interrupt handlers execute as kernel threads, they have the ability to sleep if necessary. This means that you can now use mutex locks (see “Using Mutex Locks” in Chapter 8) between the upper half and interrupt handler. Although you do not want an interrupt handler to be delayed, it is much better for a kernel thread to sleep briefly while waiting for a lock, than for it to spin in a tight loop. In general, mutex locks are more efficient than spinning locks.

In the event you must maintain a multiprocessor driver that operates in both IRIX 6.4 and an earlier, nonthreaded version, you can make the choice of lock type dynamically using conditional compilation. Example 7-8 shows one technique.

Example 7-8. Conditional Choice of Mutual Exclusion Lock Type

#ifdef INTR_KTHREADS
#define INT_LOCK_TYPE mutex_t
#define INT_LOCK_INIT(p) MUTEX_INIT(p,MUTEX_DEFAULT,”DRIVER_NAME”)
#define INT_LOCK_LOCK(p) MUTEX_LOCK(p,-1)
#define INT_LOCK_FREE(p) MUTEX_UNLOCK(p)
#else /* not a threaded kernel */
#define INT_LOCK_TYPE struct{lock_t lk, int cookie}
#define INT_LOCK_INIT(p) LOCK_INIT(&p->lk,(uchar_t)-1,plhi,(lkinfo_t)-1)
#define INT_LOCK_LOCK(p) (p->cookie=LOCK(&p->lk,plhi))
#define INT_LOCK_FREE(p) UNLOCK(&p->lk,p->cookie)
#endif

Converting a Uniprocessor Driver

As a general approach, you can convert a uniprocessor driver to make it multiprocessor-safe in the following steps:

If it currently uses the D_OLD flag, or has no pfxdevflag constant, convert it to use the current interface, with a pfxdevflag of D_MP.
Make sure it works in the original uniprocessor at the current release of IRIX.
Begin adding semaphores, locks, and other exclusion and synchronization tools. Continue to test the driver on the uniprocessor. It will never wait for a lock, but the coordination between upper half and interrupt handler should work.
Test on a multiprocessor.

In performing the conversion, you can look for calls to spl*() functions as marking points at which work is needed. These functions are used for mutual exclusion in a uniprocessor, but they are all ineffective or unnecessary in a multiprocessor-safe driver.

The code in Example 7-9 shows typical logic in a uniprocessor character driver.

Example 7-9. Uniprocessor Upper-Half Wait Logic

s = splvme();
flag |= WAITING;
while (flag & WAITING) {
   sleep(&flag, PZERO);
}
splx(s);

The upper half calls the splvme() function with the intention of blocking interrupts, and thus preventing execution of this driver's interrupt handler while the flag variable is updated. In a multiprocessor this is ineffective because at best it sets the interrupt level on the current CPU. The interrupt handler can execute on another CPU and change the variable. The corresponding interrupt handler is shown in the following example.

if (flag & WAITING) {
   wakeup(&flag);
   flag &= ~WAITING;
}

The interrupt handler could execute on another CPU, and test the flag after the upper half has called splvme() and before it has set WAITING in flag. The interrupt is effectively lost. This would happen rarely and would be hard to repeat, but it would happen and would be hard to trace. A more reliable, and simpler, technique is to use a semaphore. The driver defines a global semaphore:

static sema_t sleeper;

A driver with multiple devices would have a semaphore per device, perhaps as an array of sema_t items indexed by device minor number. The semaphore (or array) would be initialized to a starting value of 1 in the pfxinit() or pfxstart() entry:

void hypo_start()
{
...
   initnsema(&sleeper,1,"sleeper");
}

After the upper half started a device operation, it would await the interrupt using psema():

psema(sleeper,PZERO);

The PZERO argument makes the wait immune to signals. If the driver should wake up when a signal is sent to the calling process (such as SIGINT or SIGTERM), the second argument can be PCATCH. A return value of -1 indicates the semaphore was posted by a signal, not by a vsema() call. The interrupt handler would use vsema() to post the semaphore.

Prev	Table of Contents	Next
Part III. Kernel-Level Drivers		Chapter 8. Device Driver/Kernel Interface