[ofa-general] Re: [PATCH] ib/ipoib: handle Gratuitous ARP & bonding failover race also for connected mode neighbours

Mon Jan 28 04:48:00 PST 2008

On Thu, 17 Jan 2008, Or Gerlitz wrote:
> move a little up the code that checks for a situation where the remote GID stored in the ipoib_neigh is
> different than the one present in the neighbour (handle Gratuitous ARP) or that a bonding fail over has
> happened but the neighbour still has a pointer to an ipoib_neigh created not by the current slave. This
> will cause the driver to apply the check also for connected mode neighbours.

OK, Roland, I'd am now confident that this patch is needed, see below the reasonings,
please apply to 2.6.25, later I will send it also to -stable, here goes:

Basically ipoib-cm is not totaly broken wrt to bonding AND connect mode --without-- this
patch being applied, but OTOH it does not function at it should. My setup has a client node
xmitting udp unicast to a server node where the server node is bonded (ib0 and ib1 are
enslaved by bond0). I tried three types of fail-overs where each one of them causes the
bonding at the server node to send gratuitous ARP where without this patch no act is
taken by ipoib at the client side

A) using "primary slave up" (*)
B) taking an interface down
C) taking a port down

In the "primary slave up" fail-over case, since the non-active slave interface is up and running,
the traffic keeps going through it, so forever at the client side there's a neighbour pointing
to GID X where the traffic goes to (the QP associated with) GID Y.

In the interface down fail-over case, the going down code closes the RX QP, since the connected
mode (cm) is implemented over RC (...) this causes a send completion with IB_WC_RETRY_EXC_ERR
error to be generated by the HCA, ipoib_cm_handle_tx_wc calls ipoib_neigh_free and when the next
xmit is called from the stack, ipoib creates a new ipoib_neigh, this time against the correct GID

In the port going down case, again the RC implementation causes the retry exceeded error to
take place and from here its the same as in the previous case.

Other then all the above, gratitious ARP is used in other HA schemes such as floating IP address
between I/O targets, since the connected mode ignores it, this scheme will not work without the patch.

Or

(*) the bonding HA mode enables you to select a primary slave which once
up would be moved to be the active slave. So to cause this failover, I
take the primary (eg ib0) down, and then fail-over happens to the second
slave (eg ib1), now I take the primary up and a second fail-over happens.