[ofw] FYI - WinOF RC4 build status - waiting on patch review &commit.

Tzachi Dar tzachid at mellanox.co.il
Wed Oct 22 06:48:15 PDT 2008


I have decided to apply the latest version of this patch (with the
changes that have been sent today by Anatoly. (with one change from me).

The bad reason for this checkin is that I have realizes that even
without this check-in we are still seeing blue screens from time to
time. (see below for more details). In the past we have seen this
blue-screen but than we have not been able to reproduce it again. Since
ipoib will need more debugging, we have more time to catch new issues
that are introduced by this checkin.

Please see bellow for more answers.

Thanks
Tzachi

Here is the call stack for the bug that has happened. I wander if anyone
has seen it on the last builds.
1: kd> kb
ChildEBP RetAddr  Args to Child              
ba256b14 ba022042 badb0d00 00000000 01c93302 nt!KiTrap0E+0x2a7
ba256b94 ba019599 86f686c8 88f04630 80828f34
ipoib!cl_qmap_remove_item+0x2e
[s:\builds\3329\branches\mlnx_winof_2-0\core\complib\cl_map.c @ 1005]
ba256bc4 ba01aa04 00040002 896d6818 00020002
ipoib!__endpt_mgr_reset_all+0xc3
[s:\builds\3329\branches\mlnx_winof_2-0\ulp\ipoib\kernel\ipoib_port.c @
4558]
ba256c80 ba0154da 86f684c0 ba256d28 ba65d420 ipoib!ipoib_port_down+0x186
[s:\builds\3329\branches\mlnx_winof_2-0\ulp\ipoib\kernel\ipoib_port.c @
5719]
ba256ca4 ba630096 00020002 87dad6f8 87dad660 ipoib!__ipoib_pnp_cb+0x29a
[s:\builds\3329\branches\mlnx_winof_2-0\ulp\ipoib\kernel\ipoib_adapter.c
@ 710]
ba256d00 ba630f7a e16abcb8 8a8ac0fc 8ab3930c
ibbus!__pnp_notify_user+0x13c
[s:\builds\3329\branches\mlnx_winof_2-0\core\al\kernel\al_pnp.c @ 555]
ba256d14 ba6310ea 8ab3930c 8ab39138 8ab39310
ibbus!__pnp_process_port_backward+0x7e
[s:\builds\3329\branches\mlnx_winof_2-0\core\al\kernel\al_pnp.c @ 1318]
ba256d48 ba6313b4 8ab39138 8a8ac008 8a9c1154
ibbus!__pnp_check_ports+0x12c
[s:\builds\3329\branches\mlnx_winof_2-0\core\al\kernel\al_pnp.c @ 1416]
ba256d70 ba62293b 8ad16904 8a9c1128 8a9c10bc
ibbus!__pnp_check_events+0xac
[s:\builds\3329\branches\mlnx_winof_2-0\core\al\kernel\al_pnp.c @ 1566]
ba256d88 ba622aef 8a9c10bc 00000000 8a9cfaa0
ibbus!__cl_async_proc_worker+0x23
[s:\builds\3329\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @
153]
ba256d9c ba622ecf 8a9c10bc 8ab29020 ba256ddc
ibbus!__cl_thread_pool_routine+0x35
[s:\builds\3329\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @
66]
ba256dac 80948bb2 8a9cfaa0 00000000 00000000
ibbus!__thread_callback+0x21
[s:\builds\3329\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c
@ 49]
ba256ddc 8088d4d2 ba622eae 8a9cfaa0 00000000
nt!PspSystemThreadStartup+0x2e
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com] 
> Sent: Wednesday, October 22, 2008 12:48 AM
> To: Tzachi Dar; ofw at lists.openfabrics.org
> Subject: RE: [ofw] FYI - WinOF RC4 build status - waiting on 
> patch review &commit.
> 
> Tzachi Dar wrote:
> > Since this patch is changing a very sensitive area in IPOIB 
> there are 
> > three things that I would like to ask:
> >
> > 1) Taking into consideration that even without this patch 
> there is a 
> > cluster with 2000 nodes, how important is this patch?
> 
> Any method which we can reduce transaction pressure on the SA 
> is good in terms of large MPI job startup. OFED testing on 
> large clusters demonstrated SA transaction rates were a 
> limiting factor in large node count MPI job startup times. 
> Following closely behind SA transaction times in terms of 
> cost, were ARP reply processing times.
> 
> I'm not familiar with the 2000 node system you speak of or 
> what was actually accomplished on the 2000 nodes? An MPI job 
> I suspect?
This cluster was indeed running MPI.

> 
> Would a patch such as this reduce MPI startup time, by some 
> factor (at least 2000 less SA query operations)?
> Reducing startup time by minutes is very good, small numbers 
> of seconds...interesting but not so important.
> I don't know if this patch would have that kind of effect on 
> a large system; you or the Voltaire patch developers would 
> know better.
> 
This patch only has influence on the process of IPOIB going up. In other
words, the time that is needed to run an MPI job will not be decremented
in a second. On the other hand, if there are nodes that are already up,
and than they recognize a new SM this is when this patch helps. In this
case, the workload on the SM is reduced.



> 
> >
> > 2) Will it be possible to check this in only to the trunk 
> and not to 
> > the branch?
> 
> Certainly possible. I see the question as where does one 
> believe the highest degree of testing will occur? Release 
> branch or mainline?
> 
> >
> > 3) How much testing did Voltaire did with this patch.
> 
> Since the patch involves using already acquired local MAD 
> information instead of an SA query, and the MAD information 
> had been used before, then there exists some degree of 
> confidence in the MAD data.
> What's the possibility the data has gone bad?
> 
> What problems are there in getting the data to the correct consumers?
> 
> In my limited understanding, I could see the patch being 
> fairly easy to determine if it's working or not; yes?
> 

It is relatively easy to see that this patch is working just great for
it's good flow. My real fear is what happens if the local_mad fails.
I have reviewed the code (fixed one bug) and it seems fine, let's hop
that there are no other buts introduced.


> I do not know the extent of testing which was applied?
> Perhaps Voltaire developers can enlighten us?
> 
> The WinOF release members are ready to start digesting RC4; 
> the point being if/when you feel the patch is good to go - 
> others can assist in testing.
> 
> 
> From a WinOF 2.0 release schedule point of view:
> 
> 1) what problems are resolved by including this patch
> 2) Do those problems merit further delay in WinOF 2.0 release?
> 3) How long will it take to verify patch correctness?
> 
> Your questions and the WinOF release schedule impact 
> questions can all be discussed in the upcoming WWG meeting Wednesday.
> 
> Thank you for the good questions.
> Looking forward to a lively discussion.
> 
> Stan.
> 
> 
> 
> >
> > Thanks
> > Tzachi
> >
> >> -----Original Message-----
> >> From: ofw-bounces at lists.openfabrics.org 
> >> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
> >> Sent: Tuesday, October 21, 2008 6:56 PM
> >> To: ofw at lists.openfabrics.org
> >> Subject: [ofw] FYI - WinOF RC4 build status - waiting on 
> patch review 
> >> &commit.
> >>
> >>
> >> Waiting for review & commit of 'Using ib_local_mad instead of SM 
> >> query' patch.
> >>
> >> Stan.
> >> _______________________________________________
> >> ofw mailing list
> >> ofw at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 
> 



More information about the ofw mailing list