[ofa-general] What causes "SRP abort called" error?

John Valdes valdes at anl.gov
Tue Jan 8 15:44:17 PST 2008


Hello,

I'm new to SRP & IB, so please bear with me... 

We have a storage server running RHEL 5.1 w/ the bundled OFED 1.2
stack directly attached to an IB port on a DDN 9550.  It's been running
OK for about a week, but today we're getting a continuous stream of
SRP abort errors:

  # tail /var/log/messages
  [...]
  Jan  8 17:00:59 server kernel: SRP abort called
  Jan  8 17:01:59 server kernel: SRP abort called
  Jan  8 17:02:04 server kernel: SRP reset_device called
  Jan  8 17:02:09 server kernel: ib_srp: SRP reset_host called
  Jan  8 17:02:11 server kernel: ib_srp: connection closed

How can I determine the cause of the aborts?  The physical connection
between the server and the DDN seems to be OK (the error counts in
/sys/class/infiniband/mthca0/ports/1/counters/* are all zero), and the
SM (opensm) is still running.  Are the aborts being triggered by the
server or by the storage target (the DDN)?  I'm guessing something is
timing out, but what, and why?

To complicate matters, the LUNs on the DDN are shared with 7 other
servers as clustered LVM volumes with GFS filesystems.  Each of the
other servers has its own, direct IB connection to the DDN.

Any suggestions on how to track down the cause of the aborts would be
welcome. 

Thanks,

John

----------------------------------------------------------------------
John Valdes                  Mathematics and Computer Science Division
valdes at anl.gov                             Argonne National Laboratory



More information about the general mailing list