Andrew,<br><br><div><span class="gmail_quote">On 7/19/07, <b class="gmail_sendername">Andrew Friedley</b> <<a href="mailto:afriedle@open-mpi.org">afriedle@open-mpi.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Finally was able to have the SM switched over from Cisco on the switch<br>to OpenSM on a node.  Responses inline below..<br><br>Sean Hefty wrote:<br>>> Now the more interesting part.  I'm now able to run on a 128 node

<br>>> machine using open SM running on a node (before, I was running on an 8<br>>> node machine which I'm told is running the Cisco SM on a Topspin<br>>> switch).  On this machine, if I run my benchmark with two processes

<br>>> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able<br>>> to join  > 750 groups simultaneously from one QP on each process.  To<br>>> make this stranger, I can join only 4 groups running the same thing on

<br>>> the 8-node machine.<br>><br>> Are the switches and HCAs in the two setups the same?  If you run the<br>> same SM on both clusters, do you see the same results?<br><br>The switches are different.  The 8 node machine uses a Topspin switch,

<br>the 128 node machine uses a Mellanox switch.  Looking at `ibstat` the<br>HCAs appear to be the same (MT23108), though HCAs on the 128 node<br>machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine.  Does<br>

this matter?<br><br>Running OpenSM now, I still do not see the same results.  Behavior is<br>now the same as the 128 node machine, except when running two processes<br>per node (in which case I can join as many groups as I like on the 128

<br>node machine).  On the 8 node machine I am still limited to 4 groups in<br>this case.  </blockquote><div><br>

I'm not quite parsing what is the same with what is different in the results (and I presume the only variable is SM).<br>

</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">This makes me think the switch is involved, is this correct?</blockquote><div><br>

I doubt it. It is either end station, SM, or a combination of the two. <br>

</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">><br>>> While doing so I noticed that the time from calling<br>>> rdma_join_multicast() to the event arrival stayed fairly constant (in

<br>>> the .001sec range), while the time from the join call to actually<br>>> receiving messages on the group steadily increased from around .1 secs<br>>> to around 2.7 secs with 750+ groups.  Furthermore, this time does not

<br>>> drop back to .1 secs if I stop the benchmark and run it (or any of my<br>>> other multicast code) again.  This is understandable within a single<br>>> program run, but the fact that behavior persists across runs concerns

>> me -- feels like a bug, but I don't have much concrete here. > > Even after all nodes leave all multicast groups, I don't believe that > there's a requirement for the SA to reprogram the switches immediately.

<br>>  So if the switches or the configuration of the swtiches are part of the<br>> problem, I can imagine seeing issues between runs.<br>><br>> When rdma_join_multicast() reports the join event, it means either: the

<br>> SA has been notified of the join request, or, if the port has already<br>> joined the group, that a reference count on the group has been<br>> incremented.  The SA may still require time to program the switch

<br>> forwarding tables.<br><br>OK this makes sense, but I still don't see where all the time is going.<br>  Should the fact that the switches haven't been reprogrammed since<br>leaving the groups really effect how long it takes to do a subsequent

<br>join?  I'm not convinced.</blockquote><div><br>

It takes time for the SM to recalculate the multicast tree. While

leaves can be lazy, I forget whether joins are synchronous or not. </div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Is this time being consumed by the switches when the are asked to

<br>reprogram their tables (I assume some sort of routing table is used<br>internally)?</blockquote><div><br>

This is relatively quick compared to the policy for the SM rerouting of multicast based on joins/leaves/group creation/deletion.<br>

<br>

</div>-- Hal<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">  What could they be doing that takes so long to do that? Is it something that a firmware change on the switch could alleviate?

<br><br>Andrew<br>_______________________________________________<br>general mailing list<br><a href="mailto:general@lists.openfabrics.org">general@lists.openfabrics.org</a><br><a href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general">

http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general</a><br><br>To unsubscribe, please visit <a href="http://openib.org/mailman/listinfo/openib-general">http://openib.org/mailman/listinfo/openib-general</a><br></blockquote>

</div><br>