[ofa-general] Limited number of multicasts groups that can be joined?

Thu Jul 19 10:32:03 PDT 2007

Andrew,

On 7/19/07, Andrew Friedley <afriedle at open-mpi.org> wrote:
>
> Finally was able to have the SM switched over from Cisco on the switch
> to OpenSM on a node.  Responses inline below..
>
> Sean Hefty wrote:
> >> Now the more interesting part.  I'm now able to run on a 128 node
> >> machine using open SM running on a node (before, I was running on an 8
> >> node machine which I'm told is running the Cisco SM on a Topspin
> >> switch).  On this machine, if I run my benchmark with two processes
> >> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able
> >> to join  > 750 groups simultaneously from one QP on each process.  To
> >> make this stranger, I can join only 4 groups running the same thing on
> >> the 8-node machine.
> >
> > Are the switches and HCAs in the two setups the same?  If you run the
> > same SM on both clusters, do you see the same results?
>
> The switches are different.  The 8 node machine uses a Topspin switch,
> the 128 node machine uses a Mellanox switch.  Looking at `ibstat` the
> HCAs appear to be the same (MT23108), though HCAs on the 128 node
> machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine.  Does
> this matter?
>
> Running OpenSM now, I still do not see the same results.  Behavior is
> now the same as the 128 node machine, except when running two processes
> per node (in which case I can join as many groups as I like on the 128
> node machine).  On the 8 node machine I am still limited to 4 groups in
> this case.

I'm not quite parsing what is the same with what is different in the results
(and I presume the only variable is SM).

This makes me think the switch is involved, is this correct?

I doubt it. It is either end station, SM, or a combination of the two.

>
> >> While doing so I noticed that the time from calling
> >> rdma_join_multicast() to the event arrival stayed fairly constant (in
> >> the .001sec range), while the time from the join call to actually
> >> receiving messages on the group steadily increased from around .1 secs
> >> to around 2.7 secs with 750+ groups.  Furthermore, this time does not
> >> drop back to .1 secs if I stop the benchmark and run it (or any of my
> >> other multicast code) again.  This is understandable within a single
> >> program run, but the fact that behavior persists across runs concerns
> >> me -- feels like a bug, but I don't have much concrete here.
> >
> > Even after all nodes leave all multicast groups, I don't believe that
> > there's a requirement for the SA to reprogram the switches immediately.
> >  So if the switches or the configuration of the swtiches are part of the
> > problem, I can imagine seeing issues between runs.
> >
> > When rdma_join_multicast() reports the join event, it means either: the
> > SA has been notified of the join request, or, if the port has already
> > joined the group, that a reference count on the group has been
> > incremented.  The SA may still require time to program the switch
> > forwarding tables.
>
> OK this makes sense, but I still don't see where all the time is going.
>   Should the fact that the switches haven't been reprogrammed since
> leaving the groups really effect how long it takes to do a subsequent
> join?  I'm not convinced.

It takes time for the SM to recalculate the multicast tree. While leaves can
be lazy, I forget whether joins are synchronous or not.

Is this time being consumed by the switches when the are asked to
> reprogram their tables (I assume some sort of routing table is used
> internally)?

This is relatively quick compared to the policy for the SM rerouting of
multicast based on joins/leaves/group creation/deletion.

-- Hal

  What could they be doing that takes so long to do that?
> Is it something that a firmware change on the switch could alleviate?
>
> Andrew
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070719/7ac9a061/attachment.html>