[ofw] OpenSm issues
Smith, Stan
stan.smith at intel.com
Wed Mar 16 09:56:17 PDT 2011
From: Uri Habusha [mailto:urih at mellanox.co.il]
Sent: Wednesday, March 16, 2011 5:10 AM
To: Smith, Stan; ofw at lists.openfabrics.org; Gilad Margalit
Cc: Tziporet Koren
Subject: OpenSm issues
Hi Stan,
In last period we returned to run the regression in debug mode. Each night we encounter many issues with OpenSm. See below 3 different issues that are related to OpenSm.
I wonder who is responsible to OpenSm? Which testing is done?
Please advise how to progress with these failures investigation and fix?
Hello,
True I was the likely the last person to touch OpenSM, although at this time I do not have any cycles to address winOFED issues.
Unfortunately you are on your own debug path.
Perhaps discussions with the new OFED for Linux OpenSM maintainer Alex Netes [alexne at mellanox.com] might shed some light on the failures?
Tzachi and Leo maintained OpenSM long before I became involved.
As always, a stack trace back without any operational/environmental context is difficult at best to make any sense of.
W.r.t. OpenSM testing:
1) all osmtest flavors passed
2) a single OpenSM (multiple Mellanox switches) configuring a 53 node HPC cluster.
3) Multiple windows OpenSMs tested for master/slave and failover operation.
4) Multiple Windows and Linux OpenSMs tested for master/slave and failover operation.
Microsoft HPC validation has used the current OpenSM on larger HPC clusters?
Stan.
Thanks Uri
0: kd> kb
RetAddr : Args to Child : Call Site
00000000`ff3f2c36 : 00000000`000a6f00 00000000`00000000 00000000`00000000 00000000`ff368e60 : ntdll!DbgBreakPoint
00000000`ff3ecfbc : 00000000`00602ba0 00000000`006fdde0 00000000`00000001 00000000`74da554c : opensm!osm_vendor_send+0x106 [s:\builds\7523\trunk\ulp\opensm\user\libvendor\osm_vendor_ibumad.c @ 1057]
00000000`ff3ed26f : 00000000`000cf7a0 00000000`006fdde0 00000000`00000001 00000000`ff367eb8 : opensm!vl15_send_mad+0x8c [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_vl15intf.c @ 81]
00000000`74db2d3a : 00000000`000cf7a0 00000000`00000000 00000000`00000000 00000000`00000000 : opensm!vl15_poller+0x16f [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_vl15intf.c @ 151]
00000000`76c2be3d : 00000000`000cf7b8 00000000`00000000 00000000`00000000 00000000`00000000 : complibd!cl_thread_callback+0x1a [s:\builds\7523\trunk\core\complib\user\cl_thread.c @ 49]
00000000`76d66611 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
3: kd> kb
RetAddr : Args to Child : Call Site
00000000`74fd3c88 : 00000000`0016f748 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!DbgBreakPoint
00000000`ff91d1ae : 00000000`0016f748 00000000`001afe10 00000000`00000001 00000000`ff897eb8 : complibd!cl_qlist_remove_head+0x98 [s:\builds\7523\trunk\inc\complib\cl_qlist.h @ 1220]
00000000`74fe2d3a : 00000000`0016f700 00000000`00000000 00000000`00000000 00000000`00000000 : opensm!vl15_poller+0xae [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_vl15intf.c @ 138]
00000000`76e6466d : 00000000`0016f718 00000000`00000000 00000000`00000000 00000000`00000000 : complibd!cl_thread_callback+0x1a [s:\builds\7523\trunk\core\complib\user\cl_thread.c @ 49]
00000000`76f98791 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
3: kd> kb
RetAddr : Args to Child : Call Site
00000000`779e7396 : 00000000`00000002 00000001`00000023 00000000`005bd360 00000000`00000003 : ntdll!RtlReportCriticalFailure+0x2f
00000000`779e86c2 : 00000000`00000000 000601d8`02b138dc 00000000`00000000 00000000`00000000 : ntdll!RtlpReportHeapFailure+0x26
00000000`779ea0c4 : 00000000`005b0000 00000000`00000000 00000000`005bd200 00000000`005bd360 : ntdll!RtlpHeapHandleError+0x12
00000000`7797ea00 : 00000000`005b0000 00000000`001b2d30 00000000`005bd270 00000000`0000029c : ntdll!RtlpLogHeapFailure+0xa4
00000000`779729ac : 00000000`005b0000 00000001`00000002 00000000`000000e0 00000000`000000f0 : ntdll!RtlpAllocateHeap+0x2105
000007fe`ffad1332 : 00000000`00000003 00000000`000000e0 00000000`2821b917 00000000`00000000 : ntdll!RtlAllocateHeap+0x16c
00000000`ff6514cc : 00000000`00000000 00000000`00000000 00000000`005bd370 00000000`000000b0 : msvcrt!malloc+0x70
00000000`ff6ad144 : 00000000`000ff630 00000000`001b2990 00000000`00000100 00000000`00d4f900 : opensm!osm_mad_pool_get+0x7c [s:\builds\7523\trunk\ulp\opensm\user\opensm\osm_mad_pool.c @ 86]
000007fe`f9542a1a : 00000000`001b2940 00000000`00000000 00000000`00000000 00000000`00000000 : opensm!umad_receiver+0x3b4 [s:\builds\7523\trunk\ulp\opensm\user\libvendor\osm_vendor_ibumad.c @ 314]
00000000`7771f56d : 00000000`001b2940 00000000`00000000 00000000`00000000 00000000`00000000 : complibd!cl_thread_callback+0x1a [s:\builds\7523\trunk\core\complib\user\cl_thread.c @ 49]
00000000`77953281 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
Uri Habusha
Windows SW Development Lead
Mellanox Technologies
P.OBox 586, Yokneam 20692
Israel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20110316/482c8f1c/attachment.html>
More information about the ofw
mailing list