Chapter 10. SGI Infinite Network Bandwidth

This chapter describes SGI Infinite Network Bandwidth and covers the following topics:

SGI Infinite Network Bandwidth Overview

This section provides an overview of SGI Infinite Network Bandwidth and covers the following topics:

New Terminology Used in This Guide

This section describes terminology associated with SGI Infinite Network Bandwidth.

  • Physical interface

    A network interface that represents an actual physical connection to an external network.

  • Logical or virtual interface

    A network interface that has no physical connections to an external network, but still provides a connection across a network made up of physical interfaces.

  • Tunnelling

    A method of encapsulating data of an arbitrary type inside another protocol header to provide a method of transport for that data across a network of a different type. A tunnel requires two endpoints that understand both the encapsulating protocol and the encapsulated data payload.

  • Link pair

    A link pair is defined as a pair of interconnected physical interfaces on different machines that can be used by a Network Stripe Interface. They are defined by the primary IPv4 addresses of the physical interfaces.

  • Network stripe interface

    A logical point to point network interface that uses tunnelling across multiple link pairs to send data between hosts. The data that is sent is distributed evenly ("striped") across all interfaces, therefore, allowing the logical interface to use the combined bandwidth of all the physical interfaces belonging to it. A network stripe interface that is made up of n link pairs is often referred to as an n-stripe or an n-link stripe .

Definition of SGI Infinite Network Bandwidth

SGI Infinite Network Bandwidth is very similar in concept to disk striping. Like disk striping, it is not a new concept. However, unlike disk striping, SGI Infinite Network Bandwidth network stripes are difficult to scale efficiently.

Striped disks are commonly used when the bandwidth that a single disk can provide to a filesystem is not enough for the required purpose. Multiple disks are striped together so that the filesystem blocks are spread evenly across all the disks in the stripe set so that data is retrieved from or stored to many disks in parallel. This greatly increases the bandwidth available to the filesystem.

SGI Infinite Network Bandwidth network stripes aggregate a number of physical network interfaces into a single virtual interface that is able to spread data evenly across all the physical interfaces in the stripe set. This allows multiple packets to be sent and received in parallel. This greatly increases the available bandwidth to the virtual interface.

Like a filesystem that is made up of a striped disk array, applications are able to make use of the increased bandwidth supplied by the network stripe interface without modifications. That is, the virtual interface is completely transparent to applications but provides significantly higher throughput.

In ideal conditions, a network stripe interface configured from a set of n identical interfaces will provide n-times the bandwidth of slowest interface that is part of the stripe set. In real world conditions, the number of packets that can be sent over a network stripe interface is generally limited by CPU speed.

Hardware Requirements

There are no particular hardware requirements for a network stripe. However, it is worth keeping these things in mind when deciding to use a network stripe, as follows:

  • SGI Infinite Network Bandwidth has been designed for use with Ethernet networks. Therefore, using non-Ethernet devices may give unpredictable results.

  • The network stripe driver is optimized for high bandwidth, low delay networks that have low packet loss and minimal packet reordering issues. Use on high delay networks and/or networks that suffer from excessive packet loss are is not recommended.

  • The network stripe driver is optimized for homogeneous physical link interfaces. It works with different interface types, but it is limited in packet rate by the slowest interface, and limited in packet size by the interface with the smallest Maximum Transmission Unit (MTU). For example, striping a 100Mbit Ethernet and a Gigabit Ethernet interface will only result in a network stripe with the performance of 2x100Mbit Ethernet.

  • Not all applications can benefit from a network stripe. If the application is latency bound, then a network stripe will not improve performance. If the application is not capable of using all the bandwidth available to a single interface on your machine, then a network stripe will not improve performance.

    If you do have an application that is constrained by the bandwidth available to a single interface, for example, NFS, FTP, or rsync, then a network stripe may be a way of improving throughput.

  • While SGI Infinite Network Bandwidth runs on any machine that is supported by IRIX 6.5.23, or later, the number and speed of the CPUs in the machines being used for stripe interfaces greatly impacts performance. Faster CPUs result in significantly better stripe performance.

For best results using Gigabit Ethernet network stripes, SGI recommends the following:

  • Using an SGI Origin 300 or SGI Origin 3000 series system with a minimum of four 500 MHz processors.

  • Using jumbo frames (9k MTU) on physical links where available for better performance and lower CPU utilization.

    To configure the MTU size, see “MTU Size Settings” in the SGI IRIS Release 2 Gigabit Ethernet Board User's Guide.

  • For jumbo frames, only Gigabit Ethernet PCI-X network cards are supported. They appear as tg interfaces on your system. Striping with jumbo frames on eg hardware interface types is not reliable due to hardware limitations and is therefore not recommended or supported.

    You can use the hinv(1M), ifconfig(1M), or netstat(1) commands to determine if your system has tg interfaces, as follows:

    Using the hinv(1M) command:

    root@mig127# hinv
    2 500 MHZ IP35 Processors
    CPU: MIPS R14000 Processor Chip Revision: 2.4
    FPU: MIPS R14010 Floating Point Chip Revision: 2.4
    Main memory size: 2048 Mbytes
    Instruction cache size: 32 Kbytes
    Data cache size: 32 Kbytes
    Secondary unified instruction/data cache size: 2 Mbytes
    Integral SCSI controller 0: Version QL12160, low voltage differential
      Disk drive: unit 1 on SCSI controller 0
    Integral SCSI controller 1: Version QL12160, low voltage differential
    IOC3/IOC4 serial port: tty3
    IOC3/IOC4 serial port: tty4
    Gigabit Ethernet: tg1, module 001c01, PCI bus 2 slot 1
    Gigabit Ethernet: tg2, module 001c01, PCI bus 2 slot 2
    Integral Fast Ethernet: ef0, version 1, module 001c01, pci 4
    IOC3/IOC4 external interrupts: 1
    USB controller: type OHCI
    

    Using the ifconfig -a(1M) command:

    root@mig127# ifconfig -a
    ef0: flags=8415c43<UP,BROADCAST,RUNNING,FILTMULTI,MULTICAST,CKSUM,DRVRLOCK,LINK0,IPALIAS,IPV6>
            inet 128.162.246.127 netmask 0xffffff00 broadcast 128.162.246.255
    tg1:flags=8c85802<BROADCAST,MULTICAST,CKSUM,DRVRLOCK,LINKDOWN,IPALIAS,HIGHBW,IPV6>
    tg2: flags=8c04802<BROADCAST,MULTICAST,DRVRLOCK,IPALIAS,HIGHBW,IPV6>
    cl0: flags=404800<MULTICAST,DRVRLOCK,IPALIAS>
    lo0: flags=8001849<UP,LOOPBACK,RUNNING,MULTICAST,CKSUM,IPV6>
            inet 127.0.0.1 netmask 0xff000000
    

    Using the netstat -iq(1) command:

    root@mig127# netstat -iq
    
    Name   Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs q max drop
    ef0    1500  eagan-24    mig127.americas.   699398     0   561505     0  0  0    0
    tg1*   1500  none        none                   0     0        0     0  0 2048   0
    tg2*   1500  none        none                   0     0        0     0  0  0    0
    cl0*   57344 none        none                   0     0        0     0  0 64    0
    lo0    32992 loopback    localhost          36871     0    36871     0  0 50    0
    

  • Connecting the two sets of tg cards back-to-back. It is also possible to connect them through a switch that can support the bandwidth of multiple Gigabit Ethernet links simultaneously. If jumbo frames are being used, the switch must be able to support them as well.

Configuring a Network Stripe

To configure a network stripe on your system you need to use the ifconfig(1M) command. There are several new ifconfig parameters used to manipulate network stripe interfaces, as follows:

  • stripe laddr raddr

    This parameter adds a physical link to a stripe, where laddr is the primary address of the local physical interface the stripe will use, and raddr is the corresponding address of the remote physical interface to which the stripe will send packets and from which it will receive packets. If the stripe interface that is the target of this command does not exist, it is created as a side effect of running this command. If the stripe interface does already exist, it must be down for this command to succeed.

    To add a link-pair to an is interface that is already up, the interface has to be brought down first as shown in the following example. In this example, two additional link-pairs are added to the stripe interface is0:

    > ifconfig is0 down
    > ifconfig is0 stripe 10.1.3.7 10.1.3.8
    > ifconfig is0 stripe 10.1.2.7 10.1.2.8
    

  • stripe laddr

    This parameter removes a physical link pair from the current stripe set. The link pair that is removed is the one that matches the local address laddr. The stripe interface must be down for this command to succeed. If the link being removed is the last physical link on the stripe interface, the stripe interface will also be deleted as a side effect of this command.

    In the following example, assume that is0 already has 2-link pairs in its link set. To remove one (or both) of the link pairs from the stripe, first the is0 interface has to brought down; then -stripe option is used to remove the link-pairs from the stripe, as follows:

    > ifconfig is0 down
    > ifconfig is0 -stripe 10.1.2.7
    > ifconfig is0 -stripe 10.1.3.7
    

  • stripelist

    The stripelist option lists the physical link pairs configured on a stripe interface. This information is also displayed when the ifconfig command is executed with the -a option and it encounters a stripe interface. This option only shows the link pairs that were present at the time the stripe interface was last brought up.

    In the following example, the stripelist option lists all the link-pairs that belong to a striping interface is0.

    > ifconfig is0 stripelist
                  link local 10.1.2.7 remote 10.1.2.8
                  link local 10.1.3.7 remote 10.1.3.8
    

    Because a network stripe interface is a point-to-point interface, the ifconfig command requires you to give it both a source and destination address when you configure it.

    A network stripe interface is created automatically when one of the following ifconfig configuration commands is executed. However, the stripe will not function until the stripe has both a source and destination address defined and it has at least one physical link configured.

    The ifconfig parameter commands create a stripe interface either explicitly or implicitly, as follows:

    • Explicitly by defining the source and destination addresses for the stripe interface, as follows:

      > ifconfig is0 10.0.0.194 dst 10.0.0.195
      

    • Implicitly by the adding the first physical link to a previously undefined stripe interface, as follows:

      > ifconfig is0 stripe 192.168.1.194 192.168.1.195
      

    The stripe will not be correctly configured until both of the above commands have been executed. The stripe must have both a source and destination address defined and it must have at least one physical link configured.

    Note that the stripe interface will come up as soon as it has both a source and destination address defined. When the stripe interface is up, you cannot change the physical link pair configuration. Hence, if you define the interface first by giving it a source and destination address, you must mark it down before adding the physical link pairs, and then mark it up before it becomes active.

Procedure 10-1. Configuring a Simple Stripe Interface

    To create a simple 1-stripe from scratch, perform these commands, as follows:

    > ifconfig is0 10.0.0.194 dst 10.0.0.195
    > ifconfig is0 down
    > ifconfig is0 stripe 192.168.1.194 192.168.1.195
    > ifconfig is0 up
    

    The following command sequence is equivalent to the above sequence, but is less verbose:

    > ifconfig is0 stripe 192.168.1.194 192.168.1.195
    > ifconfig is0 10.0.0.194 dst 10.0.0.195
    

    Procedure 10-2. Creating a 2-link stripe

      The following steps describes how to create a 2-link stripe, as follows:

      Machine A has two tg interfaces, tg1 and tg2, with IP addresses 192.168.1.194 and 192.168.2.194, respectively. Machine B has two tg interfaces, tg1 and tg2, with IP addresses 192.168.1.195 and 192.168.2.195, respectively.

      To set-up a stripe with tg1 and tg2 interfaces, execute the following commands on machine A and B, as follows:

      • On machine A create the stripe interface by defining its physical links, as follows:

        Machine A > ifconfig is0 stripe 192.168.1.194 192.168.1.195
        Machine A > ifconfig is0 stripe 192.168.2.194 192.168.2.195
        

      • On machine B create the stripe interface by defining its physical links, as follows:

        Machine B > ifconfig is0 stripe 192.168.1.195 192.168.1.194
        Machine B > ifconfig is0 stripe 192.168.2.195 192.168.2.194
        

      • On machine A define the stripe interface endpoint addresses and bring up the stripe, as follows:

        Machine A > ifconfig is0 10.0.0.194 10.0.0.195
        

      • On machine B define the stripe interface endpoint addresses and bring up the stripe, as follows:

        Machine B > ifconfig is0 10.0.0.195 10.0.0.194
        

      At this point you should now be able to verify the configuration of the stripe and test its connectivity.

      Verifying a Network Stripe

      This section describes how you can verify that your network stripe is correctly configured.

      Procedure 10-3. Checking Stripe Configuration

        The following steps describe how to use the ifconfig(1M) command to ensure that a stripe configuration is set-up correctly.

        • Execute the ifconfig command, as follows:

          > ifconfig -a
          

          Output similar to the following appears:

          is0:flags=8050d1<UP,POINTOPOINT,RUNNING,NOARP,CKSUM,DRVRLOCK,HIGHBW>
          	 	 	 	 inet 10.0.0.194 --> 10.0.0.195 netmask 0xff000000 
          	 	 	 	 link local 192.168.2.194 remote 192.168.2.195
          	 	 	 	 link local 192.168.1.194 remote 192.168.1.195
          

          The output should show an additional entry for the striping interface (is0) similar to what is shown above. There should be a link line for each link pair you entered and the addresses displayed should match the addresses you entered in the configuration stage in the form:

          link local laddr remote raddr
          

          A verbose listing will also show the default send and receive socket space the interface will use if it has been configured. See “Tuning a Network Stripe Interface”, for more information on how to configure these parameters.

        Procedure 10-4. Verifying Stripe Connectivity

          Once it has been confirmed that the stripe interface configuration at both the local and remote ends are correct and both interfaces are up, you can test the stripe interface with ping(1).

          Using the remote stripe interface address of the 2-stripe configuration example above, you should see the output similar to the following if your striping software is working correctly:

          > ping 10.0.0.195
            PING 10.0.0.195 (10.0.0.195): 56 data bytes
            64 bytes from 10.0.0.195: icmp_seq=0 ttl=254 time=0.488 ms
            64 bytes from 10.0.0.195: icmp_seq=1 ttl=254 time=0.373 ms
            64 bytes from 10.0.0.195: icmp_seq=2 ttl=254 time=0.335 ms
            64 bytes from 10.0.0.195: icmp_seq=3 ttl=254 time=0.379 ms
            64 bytes from 10.0.0.195: icmp_seq=4 ttl=254 time=0.327 ms
          
            ----10.0.0.195 PING Statistics----
            5 packets transmitted, 5 packets received, 0.0% packet loss
            round-trip min/avg/max = 0.327/0.380/0.488 ms
            >
          

          Monitoring both ends of the connection using the netstat -C command is also useful at this point to determine if pings have been sent down each of the physical links and if problems are occurring, to isolate where packets are going missing.

          Once you have verified connectivity in this manner, you can run further tests using ttcp(1) or the test utility of your choice to exercise the interface. If you have problems, please refer to the Troubleshooting section below.

          Stripe Interface MTUs

          This section describes the optimal Maximum Transmission Unit (MTU) sizes to use with SGI Infinite Network Bandwidth software and covers the following topics:

          Default MTU Calculation

          The MTU of a stripe interface is determined by the MTU of the physical links that the stripe uses. When a stripe is brought up, it scans each of the physical link interfaces and determines what the smallest interface MTU is from their current MTU settings. It then uses this value as the basis for the stripe interface MTU.

          The stripe interface MTU is slightly smaller than the smallest MTU of the physical links as some space is taken up by headers the stripe adds to each packet. The overhead is automatically taken into account by the stripe interface when it calculates it's MTU.

          Note that this MTU calculation means that it is not really worthwhile mixing physical links that have different MTUs as the stripe interface will not make use of the large MTUs that some links can provide and therefore does not benefit from them.

          MTUs and Performance

          Currently, the upper bound of performance for a stripe interface is determined by the number of packets per second that a stripe interface can deliver in order to the receiver. This is largely due to the fact that TCP will classify out of order packets as congestion and throttle throughput. Hence the stripe interface needs to deliver packets in order to TCP to maximize throughput.

          As a result of this, the stripe interface needs to serialize the packet delivery at some point, and at this point the stripe is limited by the number of packets a single CPU can process. As of the IRIX 6.5.23 release, a stripe interface can reorder approximately 135-140,000 packets per second.

          To put this number in context, 140,000 packets per second is approximately fifteen 100Mbit Ethernet interfaces full of traffic or 1.75 Gigabit Ethernet interfaces using 1500-byte MTUs.

          When this limitation is reached, there are only two possible ways to improve throughput. The first way is to decrease the number of packets needing to be sent. In other words, the amount of data in each packet needs to be increased, n each packet and this requires interfaces to have larger MTUs. For Gigabit Ethernet, the solution is to use jumbo frames. The second way is to decrease the amount of CPU time required to process each packet.

          By default, the reorder thread floats around between CPUs - it will tend to run on one of the NIC interrupt CPUs (usually bounce between them), but you can bind it to a separate CPU. For example, you could do this for a 2-link stripe:

          • Bind reorder thread to CPU 1

          • Bind link 1 NIC interrupts to CPU 2

          • Bind link 2 NIC interrupts to CPU 3

          Gigabit Ethernet with Jumbo Frames

          For Gigabit Ethernet, large gains in throughput and decreases in CPU utilization can be attained by using jumbo (9000-byte) frames rather than standard (1500-byte) frames.

          A link using jumbo frames running at 100% utilization will only send approximately 15,000 packets per second compared to 85,000 packets per second for a 1500-byte MTU link. Hence, the packet reordering limitation will not be an issue until a stripe with many more links is configured.

          Using jumbo frames also reduces the CPU overhead in transmitting and receiving data across the link. Less work needs to be done to send the same amount of data, and hence we use a lot less CPU resources.

          SGI recommends that you do not use jumbo frames for stripe interfaces over eg Gigabit Ethernet interfaces as there are hardware limitations that make them unreliable in striping environments.

          Gigabit Ethernet with L2TCPSEG

          If each of the local link interfaces is a tg Gigabit Ethernet interface and they all have the IFF_L2TCPSEG capability enabled (new in IRIX 6.5.23), the stripe interface will operate in a slightly different mode. Use ifconfig(1) to determine if all your interfaces have this capability enabled, as follows:

          
          tg0:flags=8f15c43<UP,BROADCAST,RUNNING,FILTMULTI,MULTICAST,CKSUM,DRVRLOCK,LINK0,L2IPFRAG,L2TCPSEG,IPALIAS,HIGHBW,IPV6>
                       inet 192.168.1.195 netmask 0xffffff00 broadcast 192.168.1.255
                       recvspace 262144 sendspace 262144 
                       speed 1000.00 Mbit/s full-duplex
          tg1:flags=8f15c43<UP,BROADCAST,RUNNING,FILTMULTI,MULTICAST,CKSUM,DRVRLOCK,LINK0,L2IPFRAG,L2TCPSEG,IPALIAS,HIGHBW,IPV6>
                       inet 192.168.2.195 netmask 0xffffff00 broadcast 192.168.2.255
                       recvspace 262144 sendspace 262144 
                       speed 1000.00 Mbit/s full-duplex
          

          The capability can be enabled for tg class interfaces by changing options in /etc/config/tgconfig.options. For specific information on how to enable this functionality, please consult the tgconfig(1M) man page.

          When all the interfaces have this flag set, the stripe interface will set its MTU to 32992 bytes and the stripe interface will require less CPU time to send TCP or UDP data.

          The receive side processes packets in exactly the same manner as previously, but due to he different internal fragmentation of the data being sent, the CPU overhead of reordering the packets is reduced by approximately 10-15%. With this mode enabled, a stripe interface can reorder approximately 150-155,000 packets per second.


          Note: Even if stripe endpoints have different MTUs (one end of the stripe link supports L2TCPSEG and the other does not), the stripe will still communicate correctly as long as the physical link pair MTUs are the same. This MTU change is transparent to the physical links and only affects the internal fragmentation of the data on the wire.



          Note: It is highly recommended that if you use the L2TCPSEG mode of the stripe that you enable the strict IP fragment reassembly option on both machines as this provides significant improvement in performance when packets are dropped by the network. This should be done even if none of the traffic over the stripe interface is using UDP. This can be enabled by executing the following systune(1) command, as follows: > systune ipv4_strict_reassembly 1


          Network Stripe Monitoring

          Monitoring of a network stripe interface may be done through all of the existing network interface monitoring tools such as netstat(1) and PCP. The kernel driver for the stripe interfaces also exports a number of metrics to PCP to indicate the health and efficiency of the stripe interfaces on the machine.

          The metrics exported from the kernel are global metrics. That is, the statistics are accumulated from all active stripe interfaces and are not broken out into per interface statistics. Despite this, a stripe that is not operating correctly can still be diagnosed from the exported statistics as error conditions tend to be uncommon. For help in diagnosing problems, you should refer to the long help text for each PCP metric available for the stripe interface.

          The stripe metrics are exported from the network.is node. You can get a list of the metrics and a one line description of each metric from pminfo(1) by executing the command, as follows:

          > pminfo -t network.is
          

          A more detailed description of each metric can be obtained by from pminfo(1) by executing the command, as follows:

          > pminfo -T network.is
          

          The current values of each metric can be obtained from pminfo(1) by executing the command, as follows:

          > pminfo -f network.is
          

          An far more useful way of viewing the stripe metrics is to use pmchart(1). This provides graphical views of the metrics over time and is the preferred method of real-time viewing of the metric values.

          The pmchart software is part of the licensed PCP monitoring product. For more information on pmchart, see Chapter 4. “Monitoring System Performance” in the Performance Co-Pilot for IRIX Advanced User's and Administrator's Guide.

          For more information on how to use pminfo(1) and pmchart(1), see their respective man pages.

          Tuning a Network Stripe Interface

          This section describes how to tune a network stripe interface and covers the following topics:

          Stripe Interface System Tunable Parameters

          There are two global stripe system tunable parameters that can affect performance of a stripe interface. These are, as follows:

          • is_rrorder - default value = 5

            This controls the size of the send quota for each physical link. It determines how many packets the stripe interface feeds to a link interface before moving onto the next link interface. The quota is 2^ is_rrorder.

            The more links you have in your interfaces, the more likely you are to want a smaller is_rrorder so you can keep packets on every link. The alternative to reducing is_rrorder is to increase the default send and receive socket buffer sizes on the stripe interface (see ifconfig Parameters below). This allows more packets to be sent at once and hence you are more likely to be able to keep each link busy.

          • is_reorder_flush_delay - default value = 5

            This is the delay in ticks (1 tick = 10ms) between executions of the window stall detection routine. This determines how quickly lost packets are detected.

            On links that suffer from frequent packet loss, setting a lower value may help maintain optimal throughput. It is not generally advisable to increase this parameter as it can cause undesirable interactions with TCP timeouts and retries that will further harm throughput when packet loss occurs.

          ifconfig Parameters

          The following ifconfig parameters may need to be set to allow applications using a stripe interface to perform well with application specific tuning:

          • [-]highbw

            This needs to be set when the stripe output packet rate goes over 30-40,000 packets per second. This reduces the CPU overhead of TCP by reducing the number of ACKs that are sent.

            Note that close attention must be paid if the stripe interface is making use of the L2TCPSEG functionality, as the stripe interface output packet rate may be substantially lower than the physical link output packet rate.

          • sspace n 

            This is the default send socket buffer space in bytes that applications will be given when they create a TCP socket that uses the stripe interface. If you do not specify a value, the system default (60KB) will be used.

            In general, the higher the bandwidth of the stripe interface, the larger n needs to be. Experimentation is the only way to determine the best value for a given configuration. However, the following table can be used as a basis for experimentation.

          Table 10-1. Throughput and Estimated Socket Buffer Size

          Throughpu

          Estimated Socket Buffer Size

          less than 50MB/s

          System default (60KB) to 128KB

          50-100MB/s

          128KB to 256KB

          100-150MB/s

          256KB to 512KB

          150-200MB/s

          512KB to 1MB

          greater than 200MB/s

          1MB to 2MB


          • rspace n 

            This is the default receive socket buffer space in bytes that applications will be given when they create a TCP socket that uses the stripe interface. If you do not specify a value, the system default (60KB) will be used.

            In general, tuning of this parameter is identical to sspace, and more often than not, will have the same value as sspace.

            For further information on the configuration of these parameters, please refer to the ifconfig(1M) man page.

          Binding and Placement

          In general, binding of physical network interface interrupts and placement of applications becomes more important as the bandwidth of a stripe interface increases. As bandwidth increases, the latencies inherent in nonuniform memory access (NUMA) architectures will play a part in the maximum throughput that an application can achieve.

          For example, it would not be unexpected to see a 10% drop in maximum throughput for each hop between the node the application is running on and the node on which the physical interface interrupt routines are running.

          Hence, for high bandwidth stripe interfaces, it is desirable to cluster applications, physical interface interrupt routines and stripe reorder threads on CPUs and nodes as topologically close to each other as possible. This is usually necessary for stripe interfaces that make use of Gigabit Ethernet physical interfaces.

          For low bandwidth stripe interfaces, there is no noticeable advantage to carefully clustering applications, interrupt routines and reorder threads close together - the bandwidth is not high enough for memory latency differences to affect throughput.

          Note that the definition of a low bandwidth stripe interface will change with CPU power and machine architecture; a low bandwidth stripe on a Origin 3000 class machine with 700MHz R14000 CPUs is likely to be considered a high bandwidth stripe on a Origin 2000 with 195Mhz R10000 CPUs. Hence, the binding and placement considerations for the stripe interfaces on these two machines are going to be very different despite the fact they will carry the same amount of data.

          tg Gigabit Ethernet Interfaces

          For tg class Gigabit Ethernet interfaces using standard frames (1500-byte MTU), it is recommended that you bind each physical interface to a separate CPU due to the interrupt load of these interfaces.

          If you are running jumbo frames (9000-byte MTU), then you can bind multiple interfaces to the one CPU without seeing any performance degradations. How many you can bind to a single CPU will be dependent on your specific hardware.

          For specific information on how to bind the interrupt CPU for a tg class Gigabit Ethernet interface using the /etc/config/tgconfig.options file, see the tgconfig(1M) man page.

          The Reorder Thread

          Several XTHREADS (for more information on thread classes, see the realtime(5) man page) are created when a stripe is configured. The only XTHREAD that will greatly effect performance is the reorder thread. The reorder thread of a stripe is responsible for delivering reordered packets to the upper network stack layers, and hence there are situations where the placement of this thread may improve performance.

          When a new stripe interface isX is configured, a reorder thread called is_reorderX gets created. This thread can be bound to a particular CPU by editing the /var/sysgen/system/irix.sm file and adding the statement, as follows:

          XTHREAD is_reorderX CPU Y
          

          Where:

          • X is the stripe interface unit number (isX) to which you wish the XTHREAD command to apply to; and,

          • Y is the number of the CPU you to which you wish to bind the reorder thread to.

          If you have multiple stripe interfaces, then you can add multiple statements to bind the different threads as required.

          In normal situations, the reorder thread will move between CPUs running on whatever CPU the scheduler decides is most appropriate. Binding this thread to a particular CPU is useful, if you wish to tightly control the locality of applications using the stripe. In most cases, it is not necessary to bind the reorder thread to a CPU. However, binding the reorder thread to a particular CPU may improve stripe performance in some cases. It is best to bind this thread (which will use up to an entire CPU) on the same or neighbor node to the node on which your applications using the stripe interface are running.

          Applications

          The placement of applications using a stripe interface will affect the maximum bandwidth that the stripe interface can provide to the application. Ideally, an application that requires as much of the bandwidth a stripe interface can provide should be located on the same node as the physical interface interrupts and the reorder thread.

          However, this is not always possible given the CPU load that the physical interface interrupts and the reorder thread can generate. Hence placing the application on an adjacent node or splitting the interrupt and reorder thread load over two nodes and running the application on one (or both) of these nodes may be the best solution.

          Placement of applications can be controlled in various ways. The simplest is to use runon(1) or to build CPU sets to define the locality in which the application can run. For more information of CPU sets, please refer to the cpuset(5) man page.

          Application I/O Sizes

          Given the bandwidth a stripe interface can provide, application I/O patterns can have major effect on throughput. Throughput can be limited on either the send or receive side if an application does not use optimal I/O patterns.

          For optimal throughput, SGI recommends the following:

          • All I/O buffers are page aligned and an exact multiple of page size. This allows the network stack to do zero-copy transmit and receive where possible. This significantly reduces CPU overhead of sending and receiving data.

          • Transmit I/O buffers should not be reused or freed immediately after they are sent if you are using TCP. This will defeat the transmit zero-copy path and result is significant extra CPU load. Release or reuse the buffers after a short period of time (say 1-2 milliseconds) to ensure that the transmit path has finished with the buffers first.

          • I/O buffers should be as large as possible to reduce the system call overhead as much as possible. That is, we have more time available to send or receive data because the percentage of time spent doing system calls is lower when we have larger buffers.

            In general, buffer sizes in the range of a quarter to twice the socket buffer size are most likely to be optimal for high bandwidth applications. For example, if you have a stripe interface with a default socket buffer space of 1MB on both the send and receive sides, then it is likely that the optimal I/O buffer sizes will fall somewhere in the range of 256KB to 2MB.

            It is worth noting that the optimal I/O buffer sizes will change from application to application, and will also vary between different stripe interface configurations. Experimentation is the only way to find the optimal buffer sizes for any given application and setup.

          It is also worth noting that if the application has to do a non-trivial amount of data processing between sending or receiving data, the application may become CPU bound and will not be able to sustain the throughput a stripe interface can provide. In these cases, you can either run multiple instances of the application to until the stripe interface is saturated (for example, running multiple parallel FTP sessions), or the application may need to be threaded so it can make use of multiple CPUs and hence be able to send, receive and process the data without being limited by a single CPU.

          Troubleshooting SGI Infinite Network Bandwidth Software

          This section describes network stripe troubleshooting and covers the following topics:

          Testing Connectivity

          If any of the physical link pairs show problems communicating with each other, then the stripe will have trouble communicating. The first step in diagnosing any stripe problem is determining if there is a problem with a link pair. If there is a problem with a link pair, it should either be fixed or removed from the stripe interfaces before attempting to reuse the stripe interface. A basic procedure for discovering if there is a problem with a link pair is, as follows:

          • Execute a ping test, as described in “Verifying a Network Stripe”, first across each physical link in the stripe set, and then across the stripe interface itself. This does the following:

            • Demonstrates that each physical link is configured correctly.

            • Ensures that each physical link pair has connectivity; and

            • If all links have connectivity, then the stripe interfaces should be able to communicate with each other.

          • As root, execute a short flood ping test, first across each physical link in the stripe set, and then across the stripe interface itself. A simple flood ping test of 1024 packets is, as follows:

            > ping -c 1024 -f 10.0.0.195
            PING 10.0.0.195 (10.0.0.195): 56 data bytes
            .
            
            ----10.0.0.195 PING Statistics----
            1024 packets transmitted, 1024 packets received, 0.0% packet loss
            round-trip min/avg/max = 0.208/0.215/0.686 ms
            4456.7 packets/sec sent, 4465.9 packets/sec received
            >
            

            This does the following:

            • Demonstrates that higher single packet rates do not trigger problems on the physical link pairs.

            • Cycles the stripe interface through all of its physical links, hence potentially locating the link that may be causing problems.

          • Execute a TCP throughput test using ttcp. Do not use the default parameters of ttcp because this will execute a test that is far too short to be useful. The test should run for at least 10 seconds to allow TCP to reach full speed. While the TCP test is running you should monitor the stripe interfaces for abnormalities as described “Diagnosing Performance Problems”.

          If a problem is not discovered with a link pair, then the stripe interfaces at both ends should be brought down and then back up, and you should re-run the stripe interface ping tests.

          If the problem only shows up in the TCP tests, then it may be your tuning of the stripe interfaces (for example, placement of threads, applications, and so on) that is non-optimal and is therefore contributing to reordering or efficiency degradations.

          If the problem still persists, you should destroy both stripe interfaces and recreate them from scratch.

          Diagnosing Performance Problems

          If your application is not performing as you expect over a stripe interface, you should attempt to tune the stripe and/or applications according to the guidelines given in “Tuning a Network Stripe Interface”.

          However, factors outside of application and stripe tuning can affect the performance of the stripe. Incorrect tuning can also lead to performance problems. The key to diagnosing these problems is the monitoring of the stripe driver PCP metrics as described in “Network Stripe Monitoring”.

          The following metrics are indicators of packet loss or extreme packet reordering problems:

          • network.is.in_underflow

          • network.is.drain_underflow

          • network.is.window_stalls

          • network.is.window_flush_null

          • network.is.window_flush_skipped

          • network.is.err_bad_version

          The following metrics are indicators of packet reordering problems:

          • network.is.up_disordered

          • network.is.window_stalls

          • network.is.window_seqno_fixup

          • network.is.window_flush_nlinks

          The following metrics can be used to gain insight into the efficiency of your stripe configuration:

          • network.is.in_window

          • network.is.in_overlap

          • network.is.up_ordered

          • network.is.link_empty_headers

          • network.is.link_header_alloc

          • network.is.link_soft_cksums

          • ratio of network.is.outq_wakeups to network.is.outq_drains

            The higher the ratio, the more efficient the stripe interface output.

          • ratio of network.is.reorder_wakeups to network.is.reorder_drains

            The higher the ratio, the more efficient the stripe interface reordered packet delivery.

          Another common performance problem is that the stripe interface throughput varies between different speeds. This is commonly caused by the application being rescheduled to run on a node closer or further away from the stripe interface. If you see this happening, it is worthwhile monitoring what CPU your application is running on and determining if the change in throughput coincides with changes in the CPU the application is running on. If this is the problem, you should consider binding your application to a CPU or cpuset to stop this variation form occurring. See “Binding and Placement”, for details on how to do this.