[Scst-devel] [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs

Vladislav Bolkhovitin vst at vlnb.net
Mon Sep 21 09:59:24 PDT 2009


Chris Worley, on 09/19/2009 01:31 AM wrote:
> On Mon, Sep 7, 2009 at 5:58 AM, Vladislav Bolkhovitin <vst at vlnb.net> wrote:
>> Chris Worley, on 09/06/2009 05:41 PM wrote:
>>> On Sun, Sep 6, 2009 at 3:36 PM, Chris Worley<worleys at gmail.com> wrote:
>>>> On Sun, Sep 6, 2009 at 3:17 PM, Bart Van Assche<bart.vanassche at gmail.com>
>>>> wrote:
>>>>> On Fri, Sep 4, 2009 at 1:20 AM, Chris Worley <worleys at gmail.com> wrote:
>>>>>> On Thu, Sep 3, 2009 at 11:38 AM, Chris Worley<worleys at gmail.com> wrote:
>>>>>>> I've used a couple of initiators (different systems) w/ different
>>>>>>> OSes, w/ different IB cards (all QDR) and different IB stacks
>>>>>>> (built-in vs. OFED) and can repeat the problem in all but the
>>>>>>> RHEL5.2/OFED 1.4.1 target and initiator (but, if the initiator is
>>>>>>> WinOF and the target is RHEL5.2/OFED1.4.1, then the problem does
>>>>>>> repeat).
>>>>>> Here's a twist: I used the Ubuntu initiator w/ one of the RHEL
>>>>>> targets, and the RHEL initiator (same machine as was running WinOF
>>>>>> from the beginning of this thread) w/ one of the Ubuntu targets: in
>>>>>> both cases, the problem does not repeat.
>>>>>>
>>>>>> That makes it sound like OFED is the cure on either side of the
>>>>>> connection, but does not explain the issue w/ WinOF (which does fail
>>>>>> w/ either Ununtu or RHEL targets).
>>>>> These results are strange. Regarding the Linux-only tests, I was
>>>>> assuming failure of a single component (Ubuntu SRP initiator, OFED SRP
>>>>> initiator, Ubuntu IB driver, OFED IB driver or SRP target), but for
>>>>> each of these components there is at least one test that passes and at
>>>>> least one test that fails. So either my assumption is wrong or one of
>>>>> the above test results is not repeatable. Do you have the time to
>>>>> repeat the Linux-only tests ?
>>>> Last night I was rerunning the RHEL5.2 initiator w/ Ubuntu client, and
>>>> the problem repeated; now, I can't repeat the case where it didn't
>>>> fail.  Still, no errors, other than the eventual timeouts previously
>>>> shown; the target thinks all is fine, the initiator is stuck.
>>> ... and I haven't had any success w/ Ubuntu target and initiator, 8.10 or
>>> 9.04.
>> 1. Try with kernel parameter maxcpus=1. It will somehow relax possible races
>> you have, although not completely.
> 
> I finally got around to this test... 1 CPU works very well, w/o hangs
> (will test all night to see if this holds true), 2 or more don't.
> This is dual-socket NHM, so I can't specify more than one processor
> w/o getting more than one socket.

Where 1 CPU works well, on the target or initiator? The race is on the 
corresponding host.

I'd suggest you to reproduce the problem with the latest SCST trunk, 
lockdep enabled on the suspected host (better on both) and mgmt_minor 
trace level enabled on the target. Then, after the hang, let the system 
stay for about a half an hour, then send us with Bart (privately, 
compressed) kernel logs from both systems starting from the early boot 
messages.

If you have dmesg only output, please enable printk timestamps 
(CONFIG_PRINTK_TIME).

> Chris
>> 2. Try with another hardware, including motherboard. You can have something
>> like http://lkml.org/lkml/2007/7/31/558 (not exactly it, of course)
>>
>>> Chris
>>>> Chris
>>>>> Bart.
>>>>>




More information about the general mailing list