Chapter 1. Introduction to CXFS

Chapter 1. Introduction to CXFS
Prev		Next

Note: You should read through this entire book, especially Chapter 18, “Troubleshooting”, before attempting to install and configure a CXFS cluster.

This chapter discusses the following:

Note: In this book, Linux 64-bit refers to the SGI ProPack for Linux operating system running on the SGI Altix 3000 system.

What is CXFS?

CXFS is clustered XFS, a clustered filesystem for high-performance computing environments.

CXFS allows groups of computers to coherently share XFS filesystems among multiple hosts and storage devices while maintaining high performance. CXFS runs on storage area network (SAN) disks, such as Fibre Channel. A SAN is a high-speed, scalable network of servers and storage devices that provides storage resource consolidation, enhanced data access/availability, and centralized storage management. CXFS filesystems are mounted across the cluster by CXFS management software. All files in the filesystem are available to all nodes that mount the filesystem.

CXFS and IRIS FailSafe share the same infrastructure.

Comparison of XFS and CXFS

CXFS uses the same filesystem structure as XFS. A CXFS filesystem is initially created using the same mkfs command used to create standard XFS filesystems.

The primary difference between XFS and CXFS filesystems is the way in which filesystems are mounted and managed:

In XFS:
- Filesystems are mounted with the mount command directly by the system during boot via an entry in /etc/fstab or by the IRIX Filesystem Manager.
- A filesystem resides on only one host.
- The /etc/fstab file contains static information about filesystems. For more information, see the fstab man page.
In CXFS:
- Filesystems are mounted using the CXFS Manager graphical user interface (GUI) or the cmgr command.
- A filesystem is accessible those hosts (nodes) in the cluster that are defined to mount it. CXFS filesystems are mounted across the cluster by CXFS management software. All files in the filesystem are visible to those hosts that are defined to mount the filesystem.
- One node coordinates the updating of metadata (information that describes a file, such as the file's name, size, location, and permissions) on behalf of all nodes in a cluster; this is known as the metadata server.
  
  There is one active metadata server per CXFS filesystem; there can be multiple active metadata servers in a cluster, one for each CXFS filesystem.
- The filesystem information is stored in the cluster database (CDB), which contains persistent static configuration information about the filesystems, nodes, and cluster. The CXFS cluster daemons manage the distribution of multiple synchronized copies of the cluster database across the CXFS administration nodes in the pool. The administrator can view the database and modify it using the GUI or the cmgr command.
  
  The GUI shows the static and dynamic state of the cluster. For example, suppose the database contains the static information that a filesystem is enabled for mount; the GUI will display the dynamic information showing one of the following:
  - An icon indicating that the filesystem is mounted (the static and dynamic states match)
  - An icon indicating that the filesystem is ready to be mounted but the procedure cannot complete because CXFS services have not been started (the static and dynamic states do not match, but this is expected under the current circumstances)
  - An error (red) icon indicating that the filesystem is supposed to be mounted (CXFS services have been started), but it is not (the static and dynamic states do not match, and there is a problem)
  The following commands can also be used to view the cluster state:
  - cmgr and cxfs-config show the static cluster state. These commands are available on nodes used for cluster administration.
  - clconf_info and cluster_status show both the static and dynamic cluster states. These commands are available on nodes used for cluster administration.
  - cxfs_info command provides status information. This command is available on nodes that are CXFS clients but are not used for administration.
- Information is not stored in the /etc/fstab file. (However, the CXFS filesystems do show up in the /etc/mtab file.) For CXFS, information is instead stored in the cluster database.

Supported XFS Features

XFS features that are also present in CXFS include the following:

Reliability and fast (subsecond) recovery of a log-based filesystem.
64-bit scalability to 9 million terabytes (9 exabytes) per file.
Speed: high bandwidth (megabytes per second), high transaction rates (I/O per second), and fast metadata operations.
Dynamically allocated metadata space.
Quotas. You can administer quotas from any administration node in the cluster just as if this were a regular XFS filesystem.
Filesystem reorganizer (defragmenter), which must be run from the CXFS metadata server for a given filesystem. See the fsr_xfs man page.
Restriction of access to files using file permissions and access control lists (ACLs). You can also use logical unit (lun) masking or physical cabling to deny access from a specific host to a specific set of disks in the SAN.
Real-time volumes. CXFS can write to real-time files in real-time volumes on IRIX nodes. For more information about real-time volumes, see XVM Volume Manager Administrator's Guide.

CXFS preserves these underlying XFS features while distributing the I/O directly between the disks and the hosts. The efficient XFS I/O path uses asynchronous buffering techniques to avoid unnecessary physical I/O by delaying writes as long as possible. This allows the filesystem to allocate the data space efficiently and often contiguously. The data tends to be allocated in large contiguous chunks, which yields sustained high bandwidths.

The XFS directory structure is based on B-trees, which allow XFS to maintain good response times, even as the number of files in a directory grows to tens or hundreds of thousands of files.

When to Use CXFS

You should use CXFS when you have multiple nodes running applications that require high-bandwidth access to common filesystems.

CXFS performs best under the following conditions:

Data I/O operations are greater than 16 KB
Large files are being used (a lot of activity on small files will result in slower performance)
Read/write conditions are one of the following:
- All processes that perform reads/writes for a given file reside on the same node.
- The same file is read by processes on multiple nodes using buffered I/O, but there are no processes writing to the file.
- The same file is read and written by processes on more than one node using direct-access I/O.

For most filesystem loads, the scenarios above represent the bulk of the file accesses. Thus, CXFS delivers fast local file performance. CXFS is also useful when the amount of data I/O is larger than the amount of metadata I/O. CXFS is faster than NFS because the data does not go through the network.

Performance Considerations

CXFS may not give optimal performance under the following circumstances, and extra consideration should be given to using CXFS in these cases:

When you want to access files only on the local host.
When distributed applications write to shared files that are memory mapped.
When exporting a CXFS filesystem via NFS, be aware that performance will be much better when the export is performed from a CXFS metadata server than when it is performed from a CXFS client.
When access would be as slow with CXFS as with network filesystems, such as with the following:
- Small files
- Low bandwidth
- Lots of metadata transfer
Metadata operations can take longer to complete through CXFS than on local filesystems. Metadata transaction examples include the following:
- Opening and closing a file
- Changing file size (usually extending a file)
- Creating and deleting files
- Searching a directory
In addition, multiple processes on multiple hosts that are reading and writing the same file using buffered I/O can be slower with CXFS than when using a local filesystem. This performance difference comes from maintaining coherency among the distributed file buffers; a write into a shared, buffered file will invalidate data (pertaining to that file) that is buffered in other hosts.

Comparison of Network and CXFS Filesystems

Network filesystems and CXFS filesystems perform many of the same functions, but with important performance and functional differences noted here.

Network Filesystems

Accessing remote files over local area networks (LANs) can be significantly slower than accessing local files. The network hardware and software introduces delays that tend to significantly lower the transaction rates and the bandwidth. These delays are difficult to avoid in the client-server architecture of LAN-based network filesystems. The delays stem from the limits of the LAN bandwidth and latency and the shared path through the data server.

LAN bandwidths force an upper limit for the speed of most existing shared filesystems. This can be one to several orders of magnitude slower than the bandwidth possible across multiple disk channels to local or shared disks. The layers of network protocols and server software also tend to limit the bandwidth rates.

A shared fileserver can be a bottleneck for performance when multiple clients wait their turns for data, which must pass through the centralized fileserver. For example, NFS and Samba servers read data from disks attached to the server, copy the data into UDP/IP or TCP/IP packets, and then send it over a LAN to a client host. When many clients access the server simultaneously, the server's responsiveness degrades.

Note: You should not use multiple Samba servers to export the same CXFS filesystem. For more information, see “Samba” in Chapter 12.

CXFS Filesystems

CXFS is a clustered XFS filesystem that allows for logical file sharing, as with network filesystems, but with significant performance and functionality advantages. CXFS runs on top of a storage area network (SAN), where each host in the cluster has direct high-speed data channels to a shared set of disks.

Features

CXFS has the following unique features:

A peer-to-disk model for the data access. The shared files are treated as local files by all of the hosts in the cluster. Each host can read and write the disks at near-local disk speeds; the data passes directly from the disks to the host requesting the I/O, without passing through a data server or over a local area network (LAN). For the data path, each host is a peer on the SAN; each can have equally fast direct data paths to the shared disks.

Therefore, adding disk channels and storage to the SAN can scale the bandwidth. On large systems, the bandwidth can scale to gigabytes and even tens of gigabytes per second. Compare this with a network filesystem with the data typically flowing over a 1- to 100-MB-per-second LAN.

This peer-to-disk data path also removes the file-server data-path bottleneck found in most LAN-based shared filesystems.
Each host can buffer the shared disk much as it would for locally attached disks. CXFS maintains the coherency of these distributed buffers, preserving the advanced buffering techniques of the XFS filesystem.
A flat, single-system view of the filesystem; it is identical from all hosts sharing the filesystem and is not dependent on any particular host. The pathname is a normal POSIX pathname; for example, /u/username/directory.

Note: A Windows CXFS client uses the same pathname to the filesystem as other clients beneath a preconfigured drive letter.
The path does not vary if the metadata server moves from one node to another, if the metadata server name is changed, or if a metadata server is added or replaced. This simplifies storage management for administrators and users. Multiple processes on one host and processes distributed across multiple hosts have the same view of the filesystem, with performance similar on each host.

This differs from typical network filesystems, which tend to include the name of the fileserver in the pathname. This difference reflects the simplicity of the SAN architecture with its direct-to-disk I/O compared with the extra hierarchy of the LAN filesystem that goes through a named server to get to the disks.
A full UNIX filesystem interface, including POSIX, System V, and BSD interfaces. This includes filesystem semantics such as mandatory and advisory record locks. No special record-locking library is required.

Restrictions

CXFS has the following restrictions:

Some filesystem semantics are not appropriate and not supported in shared filesystems. For example, the root filesystem is not an appropriate shared filesystem. Root filesystems belong to a particular host, with system files configured for each particular host's characteristics.
All processes using a named pipe must be on the same node.
Hierarchical storage management (HSM) applications must run on the metadata server.
The inode monitor device (imon) is not supported on CXFS filesystems. See “Initial Configuration Requirements and Recommendations” in Chapter 9.

The following XFS features are not supported in CXFS:

GRIO version 1.
Swap to a file residing on a CXFS file system .

Cluster Environment

This section discusses the following:

For details about CXFS daemons, communication paths, and the flow of metadata, see Appendix A, “CXFS Software Architecture”.

Terminology

This section defines the terminology necessary to understand CXFS. Also see the Glossary.

Cluster

A cluster is the set of systems (nodes) configured to work together as a single computing resource. A cluster is identified by a simple name and a cluster ID. A cluster running multiple operating systems is known as a multiOS cluster.

Only one cluster may be formed from a given pool of nodes.

Disks or logical units (LUNs) are assigned to clusters by recording the name of the cluster on the disk (or LUN). Thus, if any disk is accessible (via a Fibre Channel connection) from nodes in different clusters, then those clusters must have unique names. When members of a cluster send messages to each other, they identify their cluster via the cluster ID. Thus, if two clusters will be sharing the same network for communications, then they must have unique cluster IDs. In the case of multiOS clusters, both the names and IDs must be unique if the clusters share a network.

Because of the above restrictions on cluster names and cluster IDs, and because cluster names and cluster IDs cannot be changed once the cluster is created (without deleting the cluster and recreating it), SGI advises that you choose unique names and cluster IDs for each of the clusters within your organization.

Node

A node is an operating system (OS) image, usually an individual computer. (This use of the term node does not have the same meaning as a node in an SGI Origin 3000 or SGI 2000 system.)

A given node can be a member of only one pool and therefore only one cluster.

Pool

The pool is the set of nodes from which a particular cluster may be formed. Only one cluster may be configured from a given pool, and it need not contain all of the available nodes. (Other pools may exist, but each is disjoint from the other. They share no node or cluster definitions.)

A pool is first formed when you connect to a given CXFS administration node (one that is installed with cluster_admin) and define that node in the cluster database using the CXFS GUI or cmgr command. You can then add other nodes to the pool by defining them while still connected to the first node. (If you were to connect to a different node and then define it, you would be creating a second pool).

Figure 1-1 shows the concepts of pool and cluster.

Figure 1-1. Pool and Cluster Concepts

Cluster Database

The cluster database contains configuration information about nodes, the cluster, logging information, and configuration parameters. The cluster administration daemons manage the distribution of the cluster database (CDB) across the CXFS administration nodes in the pool.

The database consists of a collection of files; you can view and modify the contents of the database by using the CXFS Manager GUI and the cmgr, cluster_status, clconf_info, cxfs-config, and cxfs_info commands. You must connect the GUI to a CXFS administration node, and the cmgr, cluster_status, and clconf_info commands must run on a CXFS administration node. You can use the cxfs_info command on client-only nodes.

Node Functions

A node can have one of the following functions:

CXFS metadata server-capable administration node (IRIX or Linux 64-bit).

This node is installed with the cluster_admin software product, which contains the full set of CXFS cluster administration daemons (fs2d, clconfd, crsd, cad, and cmond; for more details about daemons, see Appendix A, “CXFS Software Architecture”.)

This node type is capable of coordinating cluster activity and metadata. Metadata is information that describes a file, such as the file's name, size, location, and permissions. Metadata tends to be small, usually about 512 bytes per file in XFS. This differs from the data, which is the contents of the file. The data may be many megabytes or gigabytes in size.

For each CXFS filesystem, one node is responsible for updating that filesystem's metadata. This node is referred to as the metadata server. Only nodes defined as server-capable nodes are eligible to be metadata servers.

Multiple CXFS administration nodes can be defined as potential metadata servers for a given CXFS filesystem, but only one node per filesystem is chosen to be the active metadata server. All of the potential metadata servers for a given cluster must be either all IRIX or all Linux 64-bit. There can be multiple active metadata servers in the cluster, one per CXFS filesystem.

Other nodes that mount a CXFS filesystem are referred to as CXFS clients. A CXFS administration node can function as either a metadata server or CXFS client, depending upon how it is configured and whether it is chosen to be the active metadata server.

Note: Do not confuse metadata server and CXFS client with the traditional data-path client/server model used by network filesystems. Only the metadata information passes through the metadata server via the private Ethernet network; the data is passed directly to and from disk on the CXFS client via the Fibre Channel connection.

You perform cluster administration tasks by using the cmgr command running on a CXFS administration node or by using the CXFS Manager GUI and connecting it to a CXFS administration node. For more details, see:

There should be an odd number of server-capable administration nodes for quorum calculation purposes.

CXFS client administration node (IRIX or Linux 64-bit).

This is a node that is installed with the cluster_admin software product but it cannot be a metadata server. This node type should only be used when necessary for coexecution with FailSafe.
CXFS client-only node (any supported CXFS operating system).

This node is one that runs a minimal implementation of the cluster services. This node can safely mount CXFS filesystems but it cannot become a CXFS metadata server or perform cluster administration. Client-only nodes retrieve the information necessary for their tasks by communicating with an administration node. This node does not contain a copy of the cluster database.

IRIX and Linux 64-bit nodes are client-only nodes if they are installed with the cxfs_client software package and defined as client-only nodes. Nodes that are running supported operating systems other than IRIX or Linux 64-bit are always configured as CXFS client-only nodes.

For more information, see CXFS MultiOS Client-Only Guide for SGI InfiniteStorage.

Figure 1-2 shows nodes in a pool that are installed with cluster_admin and others that are installed with cxfs_client. Only those nodes with cluster_admin have the fs2d daemon and therefore a copy of the cluster database.

Figure 1-2. Installation Differences

A standby node is a server-capable administration node that is configured as a potential metadata server for a given filesystem, but does not currently run any applications that will use that filesystem. (The node can run applications that use other filesystems.)

Ideally, all administration nodes will run the same version of the operating system. However, as of IRIX 6.5.18f, SGI supports a policy for CXFS that permits a rolling annual upgrade; see “Rolling Upgrades” in Chapter 8.

The following figures show different possibilities for metadata server and client configurations. The potential metadata servers are required to be CXFS administration nodes and must all run IRIX or all run Linux 64-bit; the other nodes could be client-only nodes.

Figure 1-3. Evenly Distributed Metadata Servers

Figure 1-4. Multiple Metadata Servers

In Figure 1-4, Node4 could be running any supported OS because it is a client-only node; it is not a potential metadata server.

Figure 1-5. One Metadata Server

In Figure 1-5, Node2, Node3, and Node4 could be running any supported OS because they are client-only nodes; they are not potential metadata servers.

Figure 1-6. Standby Mode

Figure 1-6 shows a configuration in which Node1 and Node2 are potential metadata servers for filesystems /a and /b:

Node1 is the active metadata server for /a
Node2 is the active metadata server for /b

Because standby mode is used, neither Node1 nor Node2 runs applications that use /a or /b. The figure shows one client-only node, but there could be several.

Membership

The nodes in a cluster must act together to provide a service. To act in a coordinated fashion, each node must know about all the other nodes currently active and providing the service. The set of nodes that are currently working together to provide a service is called a membership:

Cluster database membership (also known as fs2d membership or user-space membership) is the group of administration nodes that are accessible to each other. (client-only nodes are not eligible for cluster database membership.) The nodes that are part of the the cluster database membership work together to coordinate configuration changes to the cluster database.
CXFS kernel membership is the group of CXFS nodes in the cluster that can actively share filesystems, as determined by the the CXFS kernel, which manages membership and heartbeating. The CXFS kernel membership may be a subset of the nodes defined in a cluster. All nodes in the cluster are eligible for CXFS kernel membership.

Heartbeat messages for each membership type are exchanged via a private network so that each node can verify each membership.

A cluster that is also running FailSafe has a FailSafe membership, which is the group of nodes that provide highly available (HA) resources for the cluster. For more information, see Appendix B, “Memberships and Quorums”, and the FailSafe Administrator's Guide for SGI InfiniteStorage.

Private Network

A private network is one that is dedicated to cluster communication and is accessible by administrators but not by users.

Note: A virtual local area network (VLAN) is not supported for a private network.

CXFS uses the private network for metadata traffic. The cluster software uses the private network to send the heartbeat/control messages necessary for the cluster configuration to function. Even small variations in heartbeat timing can cause problems. If there are delays in receiving heartbeat messages, the cluster software may determine that a node is not responding and therefore revoke its CXFS kernel membership; this causes it to either be reset or disconnected, depending upon the configuration.

Rebooting network equipment can cause the nodes in a cluster to lose communication and may result in the loss of CXFS kernel membership and/or cluster database membership ; the cluster will move into a degraded state or shut down if communication between nodes is lost. Using a private network limits the traffic on the network and therefore will help avoid unnecessary resets or disconnects. Also, a network with restricted access is safer than one with user access because the messaging protocol does not prevent snooping (illicit viewing) or spoofing (in which one machine on the network masquerades as another).

Therefore, because the performance and security characteristics of a public network could cause problems in the cluster and because heartbeat is very timing-dependent, a private network is required.

The heartbeat and control network must be connected to all nodes, and all nodes must be configured to use the same subnet for that network.

Caution: If there are any network issues on the private network, fix them before trying to use CXFS.

For more information about network segments and partitioning, see Appendix B, “Memberships and Quorums”. For information about using IP filtering for the private network, see Appendix C, “IP Filtering Example for the CXFS Private Network”.

Relocation

Relocation is the process by which the metadata server moves from one node to another due to an administrative action; other services on the first node are not interrupted.

Note: Relocation is supported only on standby nodes. Relocation is disabled by default.

A standby node is a metadata server-capable administration node that is configured as a potential metadata server for a given filesystem, but does not currently run any applications that will use that filesystem. To use relocation, you must not run any applications on any of the potential metadata servers for a given filesystem; after the active metadata server has been chosen by the system, you can then run applications that use the filesystem on the active metadata server and client-only nodes.

To use relocation in standby mode, you must enable relocation on the metadata server (relocation is disabled by default.) To enable relocation, reset the cxfs_relocation_ok parameter as follows:

IRIX:

Enable:
irix# systune cxfs_relocation_ok 1
Disable:
irix# systune cxfs_relocation_ok 0

Linux 64-bit:

Enable:

[root@linux64 root]# sysctl -w fs.cxfs.cxfs_relocation_ok=1

Disable:

[root@linux64 root]# sysctl -w fs.cxfs.cxfs_relocation_ok=0

CXFS kernel membership is not affected by relocation. However, users may experience a degradation in filesystem performance while the metadata server is relocating.

The following are examples of relocation triggers:

The system administrator uses the GUI or cmgr to relocate the metadata server.
The FailSafe CXFS resource relocates the IRIX metadata server. (FailSafe coexecution only applies to IRIX administration nodes.)
The system administrator unmounts the CXFS filesystem on an IRIX metadata server. (Unmounting on a Linux 64-bit metadata server does not trigger relocation; the Linux 64-bit server will just return an EBUSY flag.)

Recovery

Recovery is the process by which the metadata server moves from one node to another due to an interruption in services on the first node.

Note: Recovery is supported only on standby nodes.

To use recovery in standby mode, you must not run any applications on any of the potential metadata servers for a given filesystem; after the active metadata server has been chosen by the system, you can then run applications that use the filesystem on the active metadata server and client-only nodes.

The following are examples of recovery triggers:

A metadata server panic
A metadata server locks up, causing heartbeat timeouts on metadata clients
A metadata server loses connection to the heartbeat network

Figure 1-7 describes the difference between relocation and recovery for a metadata server. (Remember that there is one active metadata server per CXFS filesystem. There can be multiple active metadata servers within a cluster, one for each CXFS filesystem.)

Figure 1-7. Relocation versus Recovery

CXFS Tiebreaker

The CXFS tiebreaker node is used in the process of computing the CXFS kernel membership for the cluster when exactly half the nodes in the cluster are up and can communicate with each other. There is no default CXFS tiebreaker.

A tiebreaker should be used in addition to I/O fencing or reset; see “Isolating Failed Nodes”.

The CXFS tiebreaker differs from the FailSafe tiebreaker; see FailSafe Administrator's Guide for SGI InfiniteStorage

Isolating Failed Nodes

CXFS uses the following methods to isolate failed nodes:

I/O fencing, which isolates a problem node from the SAN by disabling a node's Fibre Channel ports so that it cannot access I/O devices, and therefore cannot corrupt data in the shared CXFS filesystem. I/O fencing can be applied to any node in the cluster. When fencing is applied, the rest of the cluster can begin immediate recovery.

I/O fencing is required to protect data for the following nodes:

The following nodes running IRIX:
- Silicon Graphics Fuel nodes
- Silicon Graphics Octane nodes
- Silicon Graphics Octane2 nodes
Nodes running other operating systems other than IRIX or Linux 64-bit

To support I/O fencing, these platforms require a Brocade Fibre Channel switch sold and supported by SGI. The fencing network connected to the Brocade switch must be physically separate from the private heartbeat network.

Note: I/O fencing differs from zoning. Fencing is a generic cluster term that means to erect a barrier between a host and shared cluster resources. Zoning is the ability to define logical subsets of the switch (zones), with the ability to include or exclude hosts and media from a given zone. A host can access only media that are included in its zone. Zoning is one possible implementation of fencing.

Zoning implementation is complex and does not have uniform availability across switches. Therefore, SGI chose to implement a simpler form of fencing: enabling/disabling a host's Fibre Channel ports.

Reset, which performs a system reset via a serial line connected to the system controller. The reset method applies only to nodes with system controllers.
I/O fencing and reset, which disables access to the SAN from the problem node and then, if the node is successfully fenced, performs an asynchronous reset if the node has a system controller; recovery begins without waiting for reset acknowledgment. This method applies to nodes with system controllers (required for reset).
CXFS shutdown, which stops CXFS kernel-based services on the node in response to a loss of CXFS kernel membership. The surviving cluster delays the beginning of recovery to allow the node time to complete the shutdown.

On nodes without system controllers, data integrity protection requires I/O fencing.

On nodes with system controllers, you would want to use I/O fencing for data integrity protection when CXFS is just a part of what the node is doing and therefore losing access to CXFS is preferable to having the system rebooted. An example of this would be a large compute server that is also a CXFS client. You would want to use reset for I/O protection on a node when CXFS is a primary activity and you want to get it back online fast; for example, a CXFS fileserver. However, I/O fencing cannot return a nonresponsive node to the cluster; this problem will require intervention from the system administrator.

You can specify how these methods are implemented by defining the failure action hierarchy, the set of instructions that determines which method is used; see “Define a Node with the GUI” in Chapter 10, and “Define a Node with cmgr” in Chapter 11. If you do not define a failure action hierarchy, the default is to perform a reset and a CXFS shutdown.

The rest of this section provides more details about I/O fencing and resets. For more information, see “Normal CXFS Shutdown” in Chapter 12.

I/O Fencing

I/O fencing does the following:

Preserves data integrity by preventing I/O from nodes that have been expelled from the cluster
Speeds the recovery of the surviving cluster, which can continue immediately rather than waiting for an expelled node to reset under some circumstances

When a node joins the CXFS kernel membership, the worldwide port name (WWPN) of its host bus adapter (HBA) is stored in the cluster database. If there are problems with the node, the I/O fencing software sends a message via the telnet protocol to the appropriate switch and disables the port.

Caution: The telnet port must be kept free in order for I/O fencing to succeed.

The Brocade Fibre Channel switch then blocks the problem node from communicating with the storage area network (SAN) resources via the corresponding HBA. Figure 1-8, describes this.

If users require access to nonclustered LUNs or devices in the SAN, these LUNs/devices must be accessed or mounted via an HBA that has been explicitly masked from fencing. For details on how to exclude HBAs from fencing for nodes, see:

For nodes running other supported operating systems, see CXFS MultiOS Client-Only Guide for SGI InfiniteStorage .

To recover, the affected node withdraws from the CXFS kernel membership, unmounts all file systems that are using an I/O path via fenced HBA(s), and then rejoins the cluster. This process is called fencing recovery and is initiated automatically. Depending on the failure action hierarchy that has been configured, a node may be reset (rebooted) before initiating fencing recovery. For information about setting the failure action hierarchy, see “Define a Node with cmgr” in Chapter 11, and “Define a Node with the GUI” in Chapter 10.

In order for a fenced node to rejoin the CXFS kernel membership, the current cluster leader must lower its fence to allow it to reprobe its XVM volumes and then remount its filesystems. If a node fails to rejoin the CXFS kernel membership, it may remain fenced. This is independent of whether the node was rebooted, because fencing is an operation applied on the Brocade Fibre Channel switch, not the affected node. In certain cases, it may therefore be necessary to manually lower a fence. For instructions, see “Lower the I/O Fence for a Node with the GUI” in Chapter 10, and “Lower the I/O Fence for a Node with cmgr” in Chapter 11.

Caution: When a fence is raised on an HBA, no further I/O is possible to the SAN via that HBA until the fence is lowered. This includes the following:

I/O that is queued in the kernel driver, on which user processes and applications may be blocked waiting for completion. These processes will return the EIO error code under UNIX, or display a warning dialog that I/O could not be completed under Windows.
I/O issued via the affected HBAs to nonclustered (local) logical units (LUNs) in the SAN or to other Fibre Channel devices such tape storage devices.

Figure 1-8. I/O Fencing

For more information, see “Switches and I/O Fencing Tasks with the GUI” in Chapter 10, and “Switches and I/O Fencing Tasks with cmgr” in Chapter 11.

Note: I/O fencing cannot be used for FailSafe nodes. FailSafe nodes require the reset capability.

Reset

IRIX and Linux 64-bit nodes with system controllers can be reset via a serial line connected to the system controller. The reset can be one of the following methods:

Power Cycle shuts off power to the node and then restarts it
Reset simulates the pressing of the reset button on the front of the machine
NMI (nonmaskable interrupt) performs a core-dump of the operating system kernel, which may be useful when debugging a faulty machine

Figure 1-9 shows an example of the CXFS hardware components for a cluster using the reset capability and an Ethernet serial port multiplexer.

Note: The reset capability or the use of I/O fencing and switches is mandatory to ensure data integrity for all nodes. Clusters should have an odd number of server-capable nodes. (See “CXFS Recovery Issues in a Cluster with Only Two Server-Capable Nodes ” in Appendix B.)

The reset connection has the same connection configuration as FailSafe; for more information, contact SGI professional or managed services.

Figure 1-9. Example of a Cluster using Reset

Nodes that have lost contact with the cluster will forcibly terminate access to shared disks. However, to ensure data integrity, SGI requires the use of reset or I/O fencing to protect data integrity. Clusters should have an odd number of server-capable nodes. Reset is required for FailSafe.

The worst scenario is one in which the node does not detect the loss of communication but still allows access to the shared disks, leading to data corruption. For example, it is possible that one node in the cluster could be unable to communicate with other nodes in the cluster (due to a software or hardware failure) but still be able to access shared disks, despite the fact that the cluster does not see this node as an active member.

In this case, the reset will allow one of the other nodes to forcibly prevent the failing node from accessing the disk at the instant the error is detected and prior to recovery from the node's departure from the cluster, ensuring no further activity from this node.

In a case of a true network partition, where an existing CXFS kernel membership splits into two halves (each with half the total number of server-capable nodes), the following will happen:

If the CXFS tiebreaker and reset or fencing are configured, the half with the tiebreaker node will reset or fence the other half. The side without the tiebreaker will attempt to forcibly shut down CXFS services.
If there is no CXFS tiebreaker node but reset or fencing is configured, each half will attempt to reset or fence the other half using a delay heuristic. One half will succeed and continue. The other will lose the reset/fence race and be rebooted/fenced.
If there is no CXFS tiebreaker node and reset or fencing is not configured, then both halves will delay, each assuming that one will win the race and reset the other. Both halves will then continue running, because neither will have been reset or fenced, leading to likely data corruption.

To avoid this situation, you should configure a tiebreaker node, and you must use reset or I/O fencing. However, if the tiebreaker node (in a cluster with only two server-capable nodes) fails, or if the administrator stops CXFS services, the other node will do a forced shutdown, which unmounts all CXFS filesystems.

If the network partition persists when the losing half attempts to form a CXFS kernel membership, it will have only half the number of server-capable nodes and be unable to form an initial CXFS kernel membership, preventing two CXFS kernel memberships in a single cluster.

The reset connections take the following forms:

Clusters of two nodes can be directly connected with serial lines.
Clusters of three or more nodes should be connected with a serial port multiplexer. Each node is defined to have an owner host, which is the node that has the ability to reset it.

For more information, contact SGI professional or managed services.

The Cluster Database and CXFS Clients

The distributed cluster database (CDB) is central to the management of the CXFS cluster. Multiple synchronized copies of the database are maintained across the CXFS administration nodes in the pool (that is, those nodes installed with the cluster_admin software package). For any given CXFS Manager GUI task or cmgr task, the CXFS cluster daemons must apply the associated changes to the cluster database and distribute the changes to each CXFS administration node before another task can begin.

The client-only nodes in the pool do not maintain a local synchronized copy of the full cluster database. Instead, one of the daemons running on a CXFS administration node provides relevant database information to those nodes. If the set of CXFS administration nodes changes, another node may become responsible for updating the client-only nodes.

Metadata Server Functions

The metadata server must perform cluster-coordination functions such as the following:

Metadata logging
File locking
Buffer coherency
Filesystem block allocation

All CXFS requests for metadata are routed over a TCP/IP network and through the metadata server, and all changes to metadata are sent to the metadata server. The metadata server uses the advanced XFS journal features to log the metadata changes. Because the size of the metadata is typically small, the bandwidth of a fast Ethernet local area network (LAN) is generally sufficient for the metadata traffic.

The operations to the CXFS metadata server are typically infrequent compared with the data operations directly to the disks. For example, opening a file causes a request for the file information from the metadata server. After the file is open, a process can usually read and write the file many times without additional metadata requests. When the file size or other metadata attributes for the file change, this triggers a metadata operation.

The following rules apply:

Any node installed with the cluster_admin product can be defined as a server-capable administration node.
Although you can configure multiple server-capable CXFS administration nodes to be potential metadata servers for a given filesystem, only the first of these nodes to mount the filesystem will become the active metadata server. The list of potential metadata servers for a given filesystem is ordered, but because of network latencies and other unpredictable delays, it is impossible to predict which node will become the active metadata server.
A single server-capable node in the cluster can be the active metadata server for multiple filesystems at once.
There can be multiple server-capable nodes that are active metadata servers, each with a different set of filesystems. However, a given filesystem has a single active metadata server on a single node.
If the last potential metadata server for a filesystem goes down while there are active CXFS clients, all of the clients will be forced out of the filesystem. (If another potential metadata server exists in the list, recovery will take place. For more information, see “Metadata Server Recovery” in Chapter 12.)
If you are exporting the CXFS filesystem to be used with other NFS clients, the filesystem should be exported from the active metadata server for best performance. For more information on NFS exporting of CXFS filesystems, see “NFS Export Scripts” in Chapter 12.

For more information, see “Flow of Metadata for Reads and Writes” in Appendix A.

System View

CXFS provides a single-system view of the filesystems; each host in the SAN has equally direct access to the shared disks and common pathnames to the files. CXFS lets you scale the shared-filesystem performance as needed by adding disk channels and storage to increase the direct host-to-disk bandwidth. The CXFS shared-file performance is not limited by LAN speeds or a bottleneck of data passing through a centralized fileserver. It combines the speed of near-local disk access with the flexibility, scalability, and reliability of clustering.

Hardware and Software Support

This section discusses the following:

Requirements

CXFS requires the following:

All server-capable administration nodes must run the same type of operating system, either IRIX or SGI ProPack for Linux 64-bit. This guide supports the IRIX 6.5.24 and SGI ProPack 3.0 for Linux. (Additional release levels may be used in the cluster while performing an upgrade; see “Rolling Upgrades” in Chapter 8.)

A supported SAN hardware configuration using the following platforms for metadata servers:

An SGI Altix 3000 server
IRIX systems with system controllers (which means that the nodes can use either reset or I/O fencing for data integrity protection):
- SGI Origin 300 server
- SGI Origin 3000 series
- SGI 2000 series
- SGI Origin 200 server
- Silicon Graphics Onyx2 system
- Silicon Graphics Onyx 3000 series
- Silicon Graphics Tezro
IRIX systems without system controllers (which means that the nodes require I/O fencing for data integrity protection):
- Silicon Graphics Fuel visual workstation
- Silicon Graphics Octane system
- Silicon Graphics Octane2 system

Note: For details about supported hardware, see the Entitlement Sheet that accompanies the release materials. Using unsupported hardware constitutes a breach of the CXFS license. CXFS does not support the Silicon Graphics O2 workstation and therefore it cannot be a CXFS reset server. CXFS does not support JBOD.

A private 100baseT or Gigabit Ethernet TCP/IP network connected to each node.

Note: When using Gigabit Ethernet, do not use jumbo frames. For more information, see the tgconfig man page.
Serial lines or Brocade Fibre Channel switches sold and supported by SGI (which support I/O fencing). Either reset or I/O fencing is required for all nodes. Clusters should have an odd number of server-capable nodes. See Chapter 3, “Brocade Fibre Channel Switch Verification”.
At least one QLogic Fibre Channel host bus adapter (HBA):
- IRIX:
  - QLA2200
  - QLA2310
  - QLA2342
  - QLA2344
- Linux 64-bit:
  - QLA2200 (copper only)
  - QLA2342
  - QLA2344
RAID hardware as specified in Chapter 2, “SGI RAID Firmware”.
Adequate compute power for CXFS nodes, particularly metadata servers, which must deal with the required communication and I/O overhead. There should be at least 2 GB of RAM on the system.

A metadata server must have at least 1 processor and 1 GB of memory more that what it would need for its normal workload (non-CXFS work). In general, this means that the minimum configuration would be 2 processors and 2 GB of memory. If the metadata server is also doing NFS or Samba serving, then more memory is recommended (and the nbuf and ncsize kernel parameters should be increased from their defaults).

CXFS makes heavy use of memory for caching. If a very large number of files (tens of thousands) are expected to be open at any one time, additional memory over the minimum is also recommended. In addition, about half of a CPU should be allocated for each Gigabit Ethernet interface on the system if it is expected to be run a close to full speed.

A FLEXlm license key for CXFS. Linux 64-bit also requires a license for XVM.

XVM provides a mirroring feature. If you want to access a mirrored volume from a given node in the cluster, you must purchase the XFS Volume Plexing software option and obtain and install a FLEXlm license. Except for Linux 64-bit systems, which always require an XVM license, only those nodes that will access the mirrored volume must be licensed. For information about purchasing this license, see your sales representative.

Note: Partitioned Origin 3000 and Onyx 3000 systems upgrading to IRIX 6.5.15f or later will require replacement licenses. Prior to IRIX 6.5.15f, these partitioned systems used the same license for all the partitions in the system. For more information, see the Start Here/Welcome and the following web page: http://www.sgi.com/support/licensing/partitionlic.html.

The XVM volume manager, which is provided as part of the IRIX release.
If you use I/O fencing and ipfilterd on a node, the ipfilterd configuration must allow communication between the node and the telnet port on the switch.

A cluster is supported with as many as 64 nodes, of which as many as 16 can be server-capable administration nodes. (See “Initial Configuration Requirements and Recommendations” in Chapter 9.)

A cluster in which both CXFS and FailSafe are run (known as coexecution) is supported with a maximum of 64 nodes, as many as 8 of which can run FailSafe. The administration nodes must run IRIX; FailSafe is not supported on Linux 64-bit nodes. Even when running with FailSafe, there is only one pool and one cluster. See “Overview of FailSafe Coexecution”, for further configuration details.

Requirements Specific to IRIX

The IRIX nodes in a cluster containing nodes running other operating systems must be running 6.5.16f or later.
IRIX nodes do not permit nested mount points on CXFS filesystems; that is, you cannot mount an IRIX XFS or CXFS filesystem on top of an existing CXFS filesystem.

Requirements Specific to Linux 64-bit

Using a Linux 64-bit node to support CXFS requires the following:

CXFS 3.2 for SGI Altix 3000 running SGI ProPack 3.0 for Linux
SGI ProPack 3.0 for Linux
An SGI Altix server
At least one QLogic QLA2310 or QLogic QLA2342 Fibre Channel host bus adapter (HBA)

IRIX nodes do not permit nested mount points on CXFS filesystems; that is, you cannot mount an IRIX XFS or CXFS filesystem on top of an existing CXFS filesystem. Although it is possible to mount other filesystems on top of a Linux 64-bit CXFS filesystem, this is not recommended.

Compatibility

CXFS is compatible with the following:

Data Migration Facility (DMF) and Tape Management Facility (TMF).
Trusted IRIX. CXFS has been qualified in an SGI Trusted IRIX cluster with the Data Migration Facility (DMF) and Tape Management Facility (TMF).

If you want to run CXFS and Trusted IRIX, all server-capable administration nodes must run Trusted IRIX. Client-only nodes can be running IRIX. For more information, see Chapter 15, “Trusted IRIX and CXFS”.
FailSafe (coexecution). See the “Overview of FailSafe Coexecution”, and the FailSafe Administrator's Guide for SGI InfiniteStorage.
IRISconsole; see the IRISconsole Administrator's Guide. (CXFS does not support the Silicon Graphics O2 workstation as a CXFS node.)
A serial port multiplexer used for the reset capability.

Recommendations

SGI recommends the following when running CXFS:

You should isolate the power supply for the Brocade Fibre Channel switch from the power supply for a node and its system controller (MSC, MMSC, L2, or L1). You should avoid any possible situation in which a node can continue running while both the switch and the system controller lose power. Avoiding this situation will prevent the possibility of multiple clusters being formed due to a lack of reset (also known as split-brain syndrome).
If you use I/O fencing, SGI recommends that you use a switched network of at least 100baseT.
Only those nodes that you want to be potential metadata servers should be CXFS administration nodes (installed with the cluster_admin software product). CXFS client administration nodes should only be used when necessary for coexecution with IRIS FailSafe. All other nodes should be client-only nodes (installed with cxfs_client).

The advantage to using client-only nodes is that they do not keep a copy of the cluster database; they contact an administration node to get configuration information. It is easier and faster to keep the database synchronized on a small set of nodes, rather than on every node in the cluster. In addition, if there are issues, there will be a smaller set of nodes on which you must look for problem.
Use a network switch rather than a hub for performance and control.
All nodes should be on the same physical network segment. Two clusters should not share the same private network.
A production cluster should be configured with a minimum of three server-capable nodes.
If you want to run CXFS and Trusted IRIX, have all nodes in the cluster run Trusted IRIX. You should configure your system such that all nodes in the cluster have the same user IDs, access control lists (ACLs), and capabilities.
As for any case with long running jobs, you should use the IRIX checkpoint and restart feature. For more information, see the cpr man page.

For more configuration and administration suggestions, see “Initial Configuration Requirements and Recommendations” in Chapter 9.

Overview of FailSafe Coexecution

CXFS allows groups of computers to coherently share large amounts of data while maintaining high performance. The FailSafe product provides a general facility for providing highly available services.

You can therefore use FailSafe in a CXFS cluster (known as coexecution) to provide highly available services (such as NFS or web) running on a CXFS filesystem. This combination provides high-performance shared data access for highly available applications in a clustered system.

CXFS 6.5.10 or later and IRIS FailSafe 2.1 or later (plus relevant patches) may be installed and run on the same system.

A subset of nodes in a coexecution cluster can be configured to be used as FailSafe nodes; a coexecution cluster can have up to eight nodes that run FailSafe.

The cluster database contains configuration information about nodes, the cluster, logging information, and configuration parameters. If you are running CXFS, it also contains information about CXFS filesystems and CXFS metadata servers, which coordinate the information that describes a file, such as the file's name, size, location, and permissions; there is one active metadata server per CXFS filesystem. If you are running FailSafe, it also contains information about resources, resource groups, and failover policies. Figure 1-10 depicts the contents of a coexecution cluster database.

Figure 1-10. Contents of a Coexecution Cluster Database

In a coexecution cluster, a subset of the nodes can run FailSafe but all of the nodes must run CXFS. If you have both FailSafe and CXFS running, the products share a single cluster and a single database. There are separate configuration GUIs for FailSafe and CXFS; the cmgr command performs configuration tasks for both CXFS and FailSafe in command-line mode. You can also view cluster information with the cluster_status and clconf_info commands.

The administration nodes can perform administrative tasks for FailSafe or CXFS and they run the fs2d cluster database daemon, which manages the cluster database and propagates it to each administration node in the pool. All FailSafe nodes are administration nodes, but some CXFS nodes do not perform administration tasks and are known as client-only nodes.

For more information, see Chapter 14, “Coexecution with FailSafe”.

Cluster Manager Tools Overview

CXFS provides a set of tools to manage the cluster. These tools execute only on the appropriate node types:

Administration nodes:
- cxfsmgr, which invokes the CXFS Manager graphical user interface (GUI)
- cmgr (also known as cluster_mgr)
- cluster_status
- clconf_info
- cxfs-config
Client-only nodes:
- cxfs_info

Note: The GUI must be connected to a CXFS administration node, but it can be launched elsewhere; see “Starting the GUI” in Chapter 10.

You can perform CXFS configuration tasks using either the GUI or the cmgr cluster manager command. These tools update the cluster database, which persistently stores metadata and cluster configuration information.

Although these tools use the same underlying software command line interface (CLI) to configure and monitor a cluster, the GUI provides the following additional features, which are particularly important in a production system:

You can click any blue text to get more information about that concept or input field. Online help is also provided with the Help button.
The cluster state is shown visually for instant recognition of status and problems.
The state is updated dynamically for continuous system monitoring.
All inputs are checked for correct syntax before attempting to change the cluster configuration information. In every task, the cluster configuration will not update until you click OK.
Tasks take you step-by-step through configuration and management operations, making actual changes to the cluster configuration as you complete a task.
The graphical tools can be run securely and remotely on any IRIX workstation or any computer that has a Java-enabled web browser, including Windows and Linux computers and laptops.

The cmgr command is more limited in its functions. It enables you to configure and administer a cluster system only on a CXFS administration node (one that is installed with the cluster_admin software product). It provides a minimum of help and formatted output and does not provide dynamic status except when queried. However, an experienced administrator may find cmgr to be convenient when performing basic configuration tasks or isolated single tasks in a production environment, or when running scripts to automate some cluster administration tasks. You can use the build_cmgr_script command to automatically create a cmgr script based on the contents of the cluster database.

After the associated changes are applied to all online database copies in the pool, the view area in the GUI will be updated. You can use the GUI or the cmgr, cluster_status, and clconf_info commands to view the state of the database. (The database is a collection of files, which you cannot access directly.) On a client-only node, you can use the cxfs_info command.

For more details, see the following:

Overview of the Installation and Configuration Steps

This section provides an overview of the installation, verification, and configuration steps for IRIX and for Linux 64-bit for SGI ProPack on SGI Altix 3000 systems.

CXFS Packages Installed

Different packages are installed on CXFS administration nodes and client-only nodes.

Client-Only Packages Installed

The following packages are installed on a client-only node:

Application binaries, documentation, and support tools:
cxfs_client cxfs_util
Kernel libraries:
cxfs eoe.sw.xvm

Administration Packages Installed

The following packages are installed on an administration node:

Application binaries, documentation, and support tools:

cluster_admin
cluster_control
cluster_services
cxfs_cluster
cxfs_util

Kernel libraries:
cxfs eoe.sw.xvm

GUI tools:

sysadm_base
sysadm_cluster
sysadm_cxfs
sysadm_xvm

CXFS Commands Installed

Different commands are installed on CXFS administration nodes and client-only nodes.

Client-Only Commands Installed

The following commands are shipped as part of the CXFS client-only package:

/usr/cluster/bin/cxfs_client (the CXFS client service)
/usr/cluster/bin/cxfs-config
/usr/cluster/bin/cxfsdump
/usr/cluster/bin/cxfslicense

These commands provide all of the services needed to include an IRIX or a Linux 64-bit client-only node.

For more information, see the cxfs_client man page.

Administration Commands Installed

The following commands are shipped as part of the CXFS administration package:

/usr/cluster/bin/clconf_info
/usr/cluster/bin/clconf_stats
/usr/cluster/bin/clconf_status
/usr/cluster/bin/clconfd
/usr/cluster/bin/cxfs-config
/usr/cluster/bin/cxfs_shutdown
/usr/cluster/bin/cxfsdump
/usr/cluster/bin/hafence
/usr/cluster/bin/cxfslicense

IRIX Overview

Following is the order of installation and configuration steps for an IRIX node:

Install IRIX 6.5.24 according to the IRIX 6.5 Installation Instructions (if not already done).
Install and verify the SGI RAID. See Chapter 2, “SGI RAID Firmware”
Install and verify the Brocade Fibre Channel switch. See Chapter 3, “Brocade Fibre Channel Switch Verification”.
Obtain and install the CXFS license. If you want to access an XVM mirrored volume from a given node in the cluster, you must purchase a mirroring software option and obtain and install a FLEXlm license. Only those nodes that will access the mirrored volume must be licensed. For information about purchasing this license, see your sales representative. See Chapter 4, “Obtaining CXFS and XVM FLEXlm Licenses”.
Prepare the node, including adding a private network. See “Adding a Private Network” in Chapter 5.
Install the CXFS software. See Chapter 6, “IRIX CXFS Installation”.
Configure the cluster to define the new node in the pool, add it to the cluster, start CXFS services, and mount filesystems. See “Guided Configuration Tasks” in Chapter 10.

Linux 64-bit on SGI Altix Overview

Following is the order of installation and configuration steps for a Linux 64-bit node:

Read the release notes README file for the Linux 64-bit platform to learn about any late-breaking changes in the installation procedure.
Install the SGI ProPack 3.0 for Linux release, according to the directions in the SGI ProPack documentation. Ensure that you select the SGI Licensed package group for installation.
Install and verify the SGI RAID. See Chapter 2, “SGI RAID Firmware”
Install and verify the Brocade Fibre Channel switch. See Chapter 3, “Brocade Fibre Channel Switch Verification”.
Obtain and install the CXFS license. If you want to access an XVM mirrored volume from a given node in the cluster, you must purchase a mirroring software option and obtain and install a FLEXlm license. Only those nodes that will access the mirrored volume must be licensed. For information about purchasing this license, see your sales representative. See Chapter 4, “Obtaining CXFS and XVM FLEXlm Licenses”.
Prepare the node, including adding a private network. See “Adding a Private Network” in Chapter 5.
Install the CXFS software. See Chapter 7, “Linux 64-bit CXFS Installation”.
Configure the cluster to define the new node in the pool, add it to the cluster, start CXFS services, and mount filesystems. See “Guided Configuration Tasks” in Chapter 10.

Prev	Table of Contents	Next
About This Guide		Chapter 2. SGI RAID Firmware