[Scst-devel] [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs

Vladislav Bolkhovitin vst at vlnb.net
Wed Sep 23 12:11:29 PDT 2009


Chris Worley, on 09/22/2009 02:00 AM wrote:
> On Mon, Sep 21, 2009 at 10:59 AM, Vladislav Bolkhovitin <vst at vlnb.net> wrote:
>> Chris Worley, on 09/19/2009 01:31 AM wrote:
>>> On Mon, Sep 7, 2009 at 5:58 AM, Vladislav Bolkhovitin <vst at vlnb.net>
>>> wrote:
>>>> Chris Worley, on 09/06/2009 05:41 PM wrote:
>>>>> On Sun, Sep 6, 2009 at 3:36 PM, Chris Worley<worleys at gmail.com> wrote:
>>>>>> On Sun, Sep 6, 2009 at 3:17 PM, Bart Van
>>>>>> Assche<bart.vanassche at gmail.com>
>>>>>> wrote:
>>>>>>> On Fri, Sep 4, 2009 at 1:20 AM, Chris Worley <worleys at gmail.com>
>>>>>>> wrote:
>>>>>>>> On Thu, Sep 3, 2009 at 11:38 AM, Chris Worley<worleys at gmail.com>
>>>>>>>> wrote:
>>>>>>>>> I've used a couple of initiators (different systems) w/ different
>>>>>>>>> OSes, w/ different IB cards (all QDR) and different IB stacks
>>>>>>>>> (built-in vs. OFED) and can repeat the problem in all but the
>>>>>>>>> RHEL5.2/OFED 1.4.1 target and initiator (but, if the initiator is
>>>>>>>>> WinOF and the target is RHEL5.2/OFED1.4.1, then the problem does
>>>>>>>>> repeat).
>>>>>>>> Here's a twist: I used the Ubuntu initiator w/ one of the RHEL
>>>>>>>> targets, and the RHEL initiator (same machine as was running WinOF
>>>>>>>> from the beginning of this thread) w/ one of the Ubuntu targets: in
>>>>>>>> both cases, the problem does not repeat.
>>>>>>>>
>>>>>>>> That makes it sound like OFED is the cure on either side of the
>>>>>>>> connection, but does not explain the issue w/ WinOF (which does fail
>>>>>>>> w/ either Ununtu or RHEL targets).
>>>>>>> These results are strange. Regarding the Linux-only tests, I was
>>>>>>> assuming failure of a single component (Ubuntu SRP initiator, OFED SRP
>>>>>>> initiator, Ubuntu IB driver, OFED IB driver or SRP target), but for
>>>>>>> each of these components there is at least one test that passes and at
>>>>>>> least one test that fails. So either my assumption is wrong or one of
>>>>>>> the above test results is not repeatable. Do you have the time to
>>>>>>> repeat the Linux-only tests ?
>>>>>> Last night I was rerunning the RHEL5.2 initiator w/ Ubuntu client, and
>>>>>> the problem repeated; now, I can't repeat the case where it didn't
>>>>>> fail.  Still, no errors, other than the eventual timeouts previously
>>>>>> shown; the target thinks all is fine, the initiator is stuck.
>>>>> ... and I haven't had any success w/ Ubuntu target and initiator, 8.10
>>>>> or
>>>>> 9.04.
>>>> 1. Try with kernel parameter maxcpus=1. It will somehow relax possible
>>>> races
>>>> you have, although not completely.
>>> I finally got around to this test... 1 CPU works very well, w/o hangs
>>> (will test all night to see if this holds true), 2 or more don't.
>>> This is dual-socket NHM, so I can't specify more than one processor
>>> w/o getting more than one socket.
>> Where 1 CPU works well, on the target or initiator?
> 
> That was on the target.
> 
>> The race is on the
>> corresponding host.
>>
>> I'd suggest you to reproduce the problem with the latest SCST trunk, lockdep
>> enabled on the suspected host (better on both) and mgmt_minor trace level
>> enabled on the target. Then, after the hang, let the system stay for about a
>> half an hour, then send us with Bart (privately, compressed) kernel logs
>> from both systems starting from the early boot messages.
> 
> I believe I comprehensively tested w/ Lockdep and complete scst
> messages dumps on the target (and lockdep on the initiator) and came
> up with no messages or lock issues salient to the issue.
> 
> If you think I should repeat this, I will.

You didn't leave it for half an hour and didn't send us the logs, did you?

But since Bart reproduced something similar, it isn't too important now, 
although still desired.

>> If you have dmesg only output, please enable printk timestamps
>> (CONFIG_PRINTK_TIME).
> 
> Ubuntu has been pretty good about that.
> 
> Thanks,
> 
> Chris
>>> Chris
>>>> 2. Try with another hardware, including motherboard. You can have
>>>> something
>>>> like http://lkml.org/lkml/2007/7/31/558 (not exactly it, of course)
>>>>
>>>>> Chris
>>>>>> Chris
>>>>>>> Bart.
>>>>>>>
>>
> 
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry® Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay 
> ahead of the curve. Join us from November 9-12, 2009. Register now!
> http://p.sf.net/sfu/devconf
> _______________________________________________
> Scst-devel mailing list
> Scst-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scst-devel
> 




More information about the general mailing list