[ofw] SRP bug?

Leonid Keller leonid at mellanox.co.il
Thu Jan 3 01:29:36 PST 2008


Hi Bill,

I didn't get reply to the previous mail.
What's up with this problem ?
Do you still need our help ?
Could you provide us the remote access ?

Leonid

> -----Original Message-----
> From: Leonid Keller 
> Sent: Wednesday, December 26, 2007 11:51 AM
> To: 'Bill Boas'; 'Randy Kreiser'; rob at systemfabricworks.com; 
> ofw at lists.openfabrics.org
> Cc: Gilad Shainer; Aviram Gutman
> Subject: RE: [ofw] SRP bug?
> 
>  Hi Bill, See the answers below
> 
> > -----Original Message-----
> > From: Bill Boas [mailto:bboas at systemfabricworks.com]
> > Sent: Tuesday, December 25, 2007 12:06 AM
> > To: 'Randy Kreiser'; Leonid Keller;
> > rob at systemfabricworks.com; ofw at lists.openfabrics.org
> > Subject: RE: [ofw] SRP bug?
> > 
> > Randy, thanks for responding to Leonid's questions.
> > 
> > Leonid,
> > 
> > The US Gov customers are anxious to learn if we are making progress 
> > understanding the cause(s) of this bug.
> > 
> > Do you have access to suitable hardware and software in the 
> Mellanox 
> > facilities where you are - (Yokneam)? To duplicate this bug and run 
> > further tests to diagnose the root causes?
> 
> No. We  do not have the same HW and, which is may be more 
> important, the same SW.
> The SRP target driver, you are working with, is some 
> home-made development, based on a (2 years ago) 1.X.0 IB Gold 
> Mellanox SRP driver, which, as I know, had some bugs in it.
> May be, not all of them have been fixed by your guys.
> On the setups, we have got here, all tests are passing OK.
> I believe, it will be worthful to have a remote access to 
> some similar setup for to continue the investigation of the problem.
> 
> > 
> > Bill.
> > 
> > Bill Boas
> > VP, Business  Development
> > System Fabric Works
> > 510-375-8840
> > bboas at systemfabricworks.com
> > www.systemfabricworks.com
> > 
> > 
> > -----Original Message-----
> > From: ofw-bounces at lists.openfabrics.org 
> > [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of 
> Randy Kreiser
> > Sent: Monday, December 24, 2007 10:39 AM
> > To: 'Leonid Keller'; rob at systemfabricworks.com; 
> > ofw at lists.openfabrics.org
> > Subject: RE: [ofw] SRP bug?
> > 
> > HI Leonid, answers are below!
> > 
> > Randy
> > 
> > -----Original Message-----
> > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > Sent: Sunday, December 23, 2007 3:53 AM
> > To: Randy Kreiser; rob at systemfabricworks.com; 
> > ofw at lists.openfabrics.org
> > Subject: RE: [ofw] SRP bug?
> > 
> > Hi Randy,
> > 
> > Thank you for the reply. 
> > A bit more questions:
> > Does it fail with ModeFlags=0 (a default value) ? 
> > 
> > 	A) Yes, it fails with settings of 0,1 and 3
> > 
> > What SW run on the target side ?
> > 
> > 	A) We are running a target driver DDN version 3.08
> > 
> > What kind of device is that appliance (from SRP point of view) ?
> > 
> > 	A) Dumb block device (RAID controller with 8 luns).
> > 
> > Does data transfer work in raw mode (without formatting) ?
> > 
> > 	A) Yes, we setup a CXFS client running windows and it reads and 
> > writes until your heart is content!
> > 
> > (you can check that with Iometer)
> > TIA
> > 
> > Leonid
> > 
> > > -----Original Message-----
> > > From: Randy Kreiser [mailto:rkreiser at datadirectnet.com]
> > > Sent: Friday, December 21, 2007 4:53 PM
> > > To: Leonid Keller; rob at systemfabricworks.com; 
> > > ofw at lists.openfabrics.org
> > > Subject: RE: [ofw] SRP bug?
> > > 
> > > Leonid, set the register you wanted to a "1" and it fails
> > much quicker
> > > but that was the only change I saw as it still fails the format.
> > > 
> > > Randy
> > > 
> > > -----Original Message-----
> > > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > > Sent: Thursday, December 20, 2007 4:50 AM
> > > To: rob at systemfabricworks.com; ofw at lists.openfabrics.org
> > > Cc: Randy Kreiser
> > > Subject: RE: [ofw] SRP bug?
> > > 
> > > Hi Rob,
> > > 
> > > Thank you for the elaborate analysis. It seems right.
> > > I'd like to get some more information, maybe you or someone
> > else can
> > > help.
> > > 
> > > Did this trace come from an IB sniffer ? 
> > > (Otherwise we can't be sure that the corruption happens at
> > Initator's
> > > side.)
> > > 
> > > How often it happens ? 
> > > 
> > > How can one reproduce it ?
> > > 
> > > What SRP target is being used ?
> > > 
> > > Could we ask (and whom) to perform experiments ?
> > > For example, I'd suggest to set ModeFlags to 1 in 
> > > HKLM\SYSTEM\CurrentControlSet\Services\ibsrp\parameters,
> > > restart SRP driver and rerun the test.
> > > 
> > > Leonid
> > > 
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: ofw-bounces at lists.openfabrics.org 
> > > > [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of
> > Robert H.B.
> > > > Netzer
> > > > Sent: Tuesday, December 18, 2007 8:38 PM
> > > > To: ofw at lists.openfabrics.org
> > > > Cc: 'Randy Kreiser'
> > > > Subject: [ofw] SRP bug?
> > > > 
> > > > I have recently been shown a trace of an SRP session
> > > between the WinOF
> > > > 1.0.1 SRP initiator and a DDN S2A9550 storage appliance
> > > that has the
> > > > following suspicious SRP_CMD.  It seems to contain a 
> bad virtual 
> > > > address.  Here is the payload of the send from the
> > initiator to the
> > > > appliance (this is a few hundred cmds into the stream):
> > > > 
> > > > 02000000 00200100 EF010000 00000000 00000000 00000000 00000000 
> > > > 00000000
> > > > 2A000064 00220000 20000000 00000000 00000000 05A83364 A8002201 
> > > > 00000010 00004000 03006209 04006309 AA002301 00004000
> > > > 
> > > > Consulting the SRP and SCSI specs and decoding this:
> > > > 
> > > > The first row indicates that it's an SRP_CMD, that there is one 
> > > > data-out buffer descriptor, and that it's an "indirect
> > data buffer
> > > > descriptor" (type 2h, encoded in the high nibble of the
> > sixth byte
> > > > above).
> > > > 
> > > > The SCSI CBD starts in the third row and is a write
> > > (10-byte CDB). The
> > > > length is 20h blocks (16k bytes).
> > > > 
> > > > The data-out buffer descriptor starts at byte 48 
> (fourth row) and 
> > > > consists of a 16-byte "indirect table memory descriptor", a
> > > four-byte
> > > > total length (00004000), and one 16-byte "partial memory
> > > descriptor" 
> > > > (there is one of these because the data-out buffer
> > > descriptor count,
> > > > the 7th byte in the SRP_CMD, is 1).
> > > > 
> > > > The suspicious part is the partial memory descriptor,
> > which is this
> > > > (copying the last four words from above): 03006209
> > > > 04006309 AA002301 00004000.  This is a virtual address of
> > > > 03006209 04006309, a memory handle (AA002301) that 
> looks like the 
> > > > other ones earlier in the trace, and a data length of 16k.
> > > > 
> > > > The SRP stream gets into trouble when the target does an
> > RDMA Read
> > > > Request using this virtual address -- it looks bogus.
> > > > 
> > > > I'm hoping that someone can double-check my decoding of
> > > this packet,
> > > > and perhaps Tzachi could take a look.
> > > > 
> > > > Rob Netzer
> > > > System Fabric Works, Inc.
> > > > 
> > > > 
> > > > _______________________________________________
> > > > ofw mailing list
> > > > ofw at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> > > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > ofw mailing list
> > ofw at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> > 



More information about the ofw mailing list