[ofa-general] using SDP for block device traffic: several problems

Amir Vadai amirv at mellanox.co.il
Wed Jul 8 05:26:18 PDT 2009


ok - bug is reproduced now on your kernel+config

will check it now.

On 07/08/2009 01:33 PM, Amir Vadai wrote:
> see below
>
> On 07/08/2009 01:17 PM, Lars Ellenberg wrote:
>   
>> On Wed, Jul 08, 2009 at 11:12:15AM +0300, Amir Vadai wrote:
>>   
>>     
>>> Lars Hi,
>>>
>>> I opened a bug in our bugzilla (https://bugs.openfabrics.org/show_bug.cgi?id=1672).
>>>
>>> I couldn't reproduce it on my setup: SLES 10SP2, stock kernel, same ofed git version.
>>> will try now to install 2.6.27 kernel and check again.
>>>     
>>>       
>> With a "normal" kernel config, I needed to do full load bi-directional
>> network traffic on IPoIB as well as SDP, multiple stream sockets,
>> to eventually actually trigger it after a few minutes
>> (several hundered MegaByte per second).
>>
>> with the "debug" kernel config,
>> I was able to reproduce with only one socket,
>> within milliseconds.
>>
>> my .config is attached.
>>   
>>     
> I will test it with your config and kernel version
>   
>>   
>>     
>>> BTW, what type of servers do you use? Are they low/high end server?
>>>     
>>>       
>> This is the second cluster that show this bug.  I first experienced it
>> when using SDP sockets from within kernel space.
>> I was able to reproduce in userland,
>> which I thought might make it easier for you to reproduce.
>>
>> The current test cluster is a slightly aged 2U supermicro dual quadcore,
>> 4GB ram, and proved to be very reliable hardware in all test up to now.
>> it may be a little slow on interrupts.
>>
>> tail of /proc/cpuinfo:
>> processor       : 7
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 15
>> model name      : Intel(R) Xeon(R) CPU           E5310  @ 1.60GHz
>> stepping        : 7
>> cpu MHz         : 1599.984
>> cache size      : 4096 KB
>> physical id     : 1
>> siblings        : 4
>> core id         : 3
>> cpu cores       : 4
>> apicid          : 7
>> initial apicid  : 7
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 10
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
>> syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl pni
>> monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca lahf_lm
>> bogomips        : 3201.35
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 36 bits physical, 48 bits virtual
>> power management:
>>
>> the IB setup is direct link, lspci says:
>> 09:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR, PCIe 2.0 5GT/s] (rev a0)
>>
>>
>> because using IPoIB does work just fine, I don't think we have issues
>> with IB setup, or the hardware in general.
>> only when using SDP it is broken, "forgets" bytes, or corrupts data.
>>   
>>     
> I'm testing on SDR low end machines - and sometimes we have bugs that we
> only see on high bandwidth setups.
> And you have such a setup.
>   
>> what I do "different" than the (assumed to be) typical SDP user is:
>> sending large-ish messages at once (up to ~32 kB), possibly unaligned.
>>
>> which apparently is a mode that SDP has not excercised much yet,
>> otherwise the recently fixed page leak would have been noticed by
>> someone much earlier.
>>
>>   
>>     
> - Amir
>   



More information about the general mailing list