[Openib-windows] LID change event

Sun Jul 9 00:47:45 PDT 2006

> -----Original Message-----
> From: ftillier.sst at gmail.com [mailto:ftillier.sst at gmail.com] 
> On Behalf Of Fabian Tillier
> Sent: Friday, July 07, 2006 8:19 PM
> To: Yossi Leybovich
> Cc: openib-windows at openib.org
> Subject: Re: [Openib-windows] LID change event
> 
> Hi Yossi,
> 
> On 7/7/06, Yossi Leybovich <sleybo at mellanox.co.il> wrote:
> >
> >
> > > -----Original Message-----
> > > From: ftillier.sst at gmail.com [mailto:ftillier.sst at gmail.com] On 
> > > Behalf Of Fabian Tillier
> > > Sent: Thursday, July 06, 2006 9:57 PM
> > >
> > > Hi Yossi,
> > >
> > > On 7/4/06, Yossi Leybovich <sleybo at mellanox.co.il> wrote:
> > > > Hi
> > > >
> > > > We found more cases that IPoIB discover duplicate LID in its 
> > > > endptlist (even after we clean the LID list in ipoib_reset_all) 
> > > > This can be cause from old packets in the network (recv packets 
> > > > create p_src endpnt if it does not exist and the packet 
> can carry 
> > > > the old LID) I think that this patch reduce the possibility of 
> > > > getting duplicate entries in the LID.
> > > > It insert to the LIDs list only when the path record 
> query is back 
> > > > (with the av).
> > >
> > > Not inserting into the LID map until the AV is created 
> means that we 
> > > won't ever report unicast packets until we've tried to 
> send to that 
> > > node.  I don't know how big of an issue this is, since most 
> > > communication start with an ARP exchange.
> > >
> > > However, there are cases where discarding unicast traffic 
> like this 
> > > is the wrong thing to do.  Think of two systems, A and B.  B 
> > > resolves A's IP address via ARP (A responded, so all is well).  A 
> > > now loses its link, but B doesn't - this flushes all of 
> A's endpoint 
> > > entries since the port went down
> > > - all endpoints lose their LID assignment.  B now tries to send 
> > > unicast packets to A - it doesn't need to ARP again since it just 
> > > did.  The packets, when received by A, fail any lookup by 
> LID, and 
> > > are discarded.
> > >
> >
> > Isn't this what will happen if the SM will change A LID.
> > If A LID is changed by the SM after the link is up(I am not really 
> > sure that the SM allowed to do that ), if B will try to send to the 
> > old LID the packets will still be discarded.
> 
> This may happen when the SM changes the LID, but we don't 
> want it to happen when the LID does not change and we had a 
> port go down due to a cabling change.  I don't know if it's 
> valid for the SM to change the LID while the port is in the 
> ACTIVE state.
> 

We also clear the Avs in case port move to down, but still nothing in
the nework was changed.
One solution to that is to move endpnt_mgr_reset_all function from
ipoib_port_down 
and call it in case we got SM/LID_CHANGE event 
And still even if we clear the LIDs map , we will discard only receive
traffic without GRH (Linux) 
the remote node will get timeout and issue ARP and all is back to
normal.

> > > > More over same as we create endpt entry in recv_arp (with LID 0 
> > > > because source LID may not be the original initiator) 
> we should do 
> > > > that  in recv_get_endpt function as well and wait to 
> the LID from 
> > > > the path record query.
> > >
> > > Looking at it, I think recv_arp is wrong, and should 
> include the LID.
> > > Otherwise further unicast traffic will be discarded.
> > >
> > > > I also add assert to check for duplication in the path_record_cb
> > > >
> > > > Another option is:
> > > > To check in each insertion to the LIDs list if the LID already 
> > > > exist in the list , if yes remove the entry from the 
> LIDs list and 
> > > > zero the LID field of the endpt struct.
> > >
> > > I think the right thing to do is to remove the old entry, and 
> > > replace it with the new anytime the LID changes.  We 
> can't require 
> > > every packet to include the GRH, as the IPoIB draft states that 
> > > implementations must handle receiving packets without a GRH.
> >
> > This will also solve the cleanup I made when the SM 
> changed(first part 
> > of the patch)
> >
> > > I have to think about this a little more - I don't know 
> what to do 
> > > with the "old" endpoint if a new one is being inserted with a 
> > > duplicate LID.  Do we just set its LID to zero, or do we 
> remove it 
> > > all together?
> >
> > I think we should set the LID to 0 clear its av (if exist) 
> and remove 
> > it from the LIDs list.
> > We should keep it in the MAC/GID list so that new sends to that 
> > destination will issue pr query to resolve the LID and send 
> the packet.
> 
> This will introduce the possibility of duplicates being found 
> when inserting into the MAC and GID maps.  Say a node X has 
> its LID changed, so it clears all the endpoint LIDs and 
> removes them from the LID map.  If that node now receives a 
> packet from some other node, it will try to create an 
> endpoint for that node, and could fail inserting into the MAC 
> and GID maps because that endpoint already exists (even if 
> the sender did not change LIDs).  I think we need to keep the 
> endpoint around, but we need to trap duplicate insertions 
> into all the maps now when a packet is received.
> 
I don't understand,
we remove the endpnt only from the LIDs map , we leave it in the
MAC/GUID map

In case we try to send to remote node , IPoIB will find it in the
MAC/GID map ) there is no AV so the IPoIB will issue query that will
fill both AV and LID.

In case we receive packet from other node:
1: GRH packets: We have its MAC/GUID in the entpnt map, IPoIB will find
it and would be able to update the endpnt LID from the completion antry.
Nothing will be create or inserted to the MAC/GUID , nothing will be
discard.
2: Packet without GRH: The IPoIB will not find it in the LID map and
discard it (it will not try to create endpnt ).
The remote node will get timeout ,will issue ARP again and all is back
to normal.
The IPoIB is allowed to discard receive packets, this packet destination
might be to another node that use this LID before the change.

> > Any way we need to come up with something because running 
> over windows 
> > CCP 8 nodes cluster get us to scenarios when LIDs changed and that 
> > hang IPoIB.
> 
> I agree this needs to be fixed, though realistically the SM 
> should not be going up and down and reassigning LIDs in an 
> 8-node cluster unless the SM is misconfigured.
> 

The Windows CCP is not stable, we loose nodes from time to time and this
force us to move the SM from node to node,
That causes the LIDS to be reconfigure from time to time.
Without the fix we get blue screen which is unacceptable.

> It's just a matter of finding the proper way to handle it at 
> this point.

The critical part for us is the part that clear the LIDs map after
SM_CHANGE event can we agree to apply it? 
The parts that handle duplicate LIDs because old packets on the
network\or the new assert in the path query can wait for full solution.

> 
> - Fab
> 
> 
>