[ofa-general] What causes "SRP abort called" error?

Tue Jan 8 20:45:03 PST 2008

On Tue, Jan 08, 2008 at 10:48:48PM -0500, David Dillow wrote:
> 
> The aborts are caused by a command timing out in the SCSI mid-layer and
> its error handling taking over -- more details about the escalation are
> in Documentation/scsi_eh.txt and assorted files.

Thanks for decoding those errors/warnings.

> You can turn up the SCSI logging facilities to track down the command
> that is dying, but expect that to be _very_ noisy on a busy system.

>From coincident errors on the DDN, it looks like these are SCSI Write
commands (2A) that are failing.

> I've often seen this during the initial bus scan when adding a target to
> SRP, and I've seen it happen under heavy load once -- maybe more, but I
> saw it today for sure.

In our case, I'm pretty sure it is heavy load.  Well, I didn't see
what was going on at the time this started, but the targets (LUNs)
were already mounted, and we've been seeing heavy load on the DDN
recently. 

> I am curious, though, what command could be getting stuck
> for long enough for the mid-layer to time it out -- I think the default
> timeout for the sd driver is 60 seconds, and the INQUIRY timeout is 5
> seconds. I just cannot account for what could be taking that long.

I'm curious too as to why WRITEs are taking so long. :) I think we're
overloading the DDN, but it could be something else going on.  This is
a freshly installed configuration (only about a week old), with 6 GFS
file servers reading and writing to ~6 shared LUNs on the DDN over
IB/SRP (which in turn are shared off the servers via NFS to a ~350
node HPC cluster).  We've been running an identical setup for a few
years with another DDN, but over FC.  I think we have still have a few
things to tune/optimize for IB.

That said, after talking with DDN support, it's looking like something
got wedged on the DDN which was causing the timeouts.

> Do your targets come back after this? During the scans, mine do, but
> today's under load effectively left the target dead. Rebooting the
> server brought it back.

Yes, after unwedging the DDN, the targets were fully accessible on the
server again.

Thanks for the reply.

John