[ofa-general] SRP on RHEL 5.3/OFED 1.3 vs RHEL 5.1/OFED 1.2?

John Valdes valdes at anl.gov
Fri May 22 11:40:06 PDT 2009


Hi all,

We have a storage array (a DDN 9550) attached to 8 servers via IB.
This setup has been running fine for the last 1.5 years or so, with
the servers running RHEL 5.1 and the OFED (OpenIB) 1.2 stack that's
included with RHEL 5.1.

Recently, we tried to upgrade to new servers running RHEL 5.3 with
its bundled OFED 1.3 stack, but now we're seeing frequent timeouts
resulting in LUN resets and SCSI command aborts between the servers
and the DDN.  As far as we can tell, our IB setup on the servers under
5.3 is identical to the setup under 5.1, so we don't know why we're
seeing the timeouts and resets.  

Is anyone aware of any changes when using IB SRP w/ RHEL 5.3 and OFED
1.3 vs RHEL 5.1/OFED 1.2 which might be causing this?

For reference, here are some of the details of our setup:

OLD CONFIGURATION
-----------------
* SuperMicro P4DP6 motherboard, w/ dual Xeon CPUs (x86, single core
  "Prestonia"), all circa 2002 hardware
* Cisco SFS-HCA-X2T7-A1 IB HCA (aka Mellanox Cougar Cub), 133 MHz PCI-X,
  128 MB memory, Firmware v3.5.917, dual port (port 1 attached to DDN)
* RHEL 5.1 w/ bundled OFED/OpenIB 1.2
* ib_mthca module loaded w/o any extra options
* ib_srp module loaded w/ option "srp_sg_tablesize=255"
* Connection to DDN established using "srp_daemon" invoked as:
  "srp_daemon -coe" with options "max_sect=8192,max_cmd_per_lun=5"
  given in /etc/srp_daemon.conf (Note that due to a bug in the OFED
  1.2 srp_daemon, the "max_sect=8192" option is ignored, which is OK
  since we weren't taking advantage of that option).
* 7 DDN LUNs are accessed by all 8 servers as clustered logical
  volumes (under RedHat's CLVM) holding GFS filesystems.
* 8 unique (not-shared) DDN LUNs are accessed by the servers (one LUN
  per server) as a plain disk holding an ext3 filesystem.

NEW CONFIGURATION
-----------------
* SuperMicro H8DME-2 motherboard, w/ dual quad-core AMD Opteron 2342, x86_64
* Cisco SFS-HCA-X2T7-A1 IB HCA (aka Mellanox Cougar Cub), 133 MHz PCI-X,
  128 MB memory, Firmware v3.5.917, dual port (port 1 attached to DDN)
  --same card as in old configuration, physically moved to new servers
* RHEL 5.3 w/ bundled OFED/OpenIB 1.3
* ib_mthca module loaded w/o any extra options
* ib_srp module loaded w/ option "srp_sg_tablesize=255"
* Connection to DDN established using "srp_daemon" invoked as:
  "srp_daemon -coe -f /etc/ofed/srp_daemon.conf" with options
  "max_sect=8192,max_cmd_per_lun=5" srp_daemon.conf
* 7 DDN LUNs are accessed by all 8 servers as clustered logical
  volumes (under RedHat's CLVM) holding GFS filesystems.
* 8 unique (not-shared) DDN LUNs are accessed by the servers (one LUN
  per server) as a plain disk holding an ext3 filesystem.


With the new configuration, timeouts/resets have frequently occurred
when starting up CLVM on the servers (eg, when the servers scan the
LUNs looking for the Linux (clustered) LVM data) as well as when doing
I/O to the mounted filesystems.  Just to make sure the CLVM/GFS setup
wasn't causing problems, we tested the plain ext3 filesystem on the
non-shared LUN from one of the new servers, and when doing a simple
"dd" to the LUN, we were still seeing timeouts and LUN resets.

Does any of this sound familiar to anyone?  Do you have a recommended
IB/SRP setup for RHEL 5.3?

John

----------------------------------------------------------------------
John Valdes                  Mathematics and Computer Science Division
valdes at anl.gov                             Argonne National Laboratory



More information about the general mailing list