[ofw] RE: NetworkDirect over WinVerbs

Mon Feb 9 20:24:01 PST 2009

>  I did not implement MWs, but that could be added.  Are the MW
> interfaces sufficient?

The MW support in the current ND provider for IB doesn't use memory windows.  When a Bind request comes in, it performs a zero-byte RDMA write, and returns the MR's RKey in the ND_MW_DESCRIPTOR.  This means that all registrations have remote access enabled.

> Does ND ever use the Rkey from memory registration?  I thought we
> discussed the possibility of having Winverbs only return the Lkey if the
> memory registration access rights did not include remote access.  (The
> winverbs implementation doesn't do this, but it wouldn't be hard to
> change.)

MSMPI doesn't use the RKey, but it does use the output of the Bind call, a ND_MW_DESCRIPTOR, that it expects to be able to send to the other side and have the other side perform the appropriate operation (Read or Write).  This allows provider implementations to vary with respect to memory windows being zero-based or not, etc.  For IB the RKey goes into the MW descriptor since the HCA driver doesn't support MWs.

>> - that disconnection mappings are a bit funny.
>> IWVConnectEndpoint::NotifyDisconnect will return in error when the
>> endpoint gets disconnected (STATUS_CONNECTION_DISCONNECTED), or when
>> the DREQ times out (STATUS_TIMEOUT).  It seems a bit unnatural that a
>> user calling Disconnect would cause a NotifyDisconnect request to
>> timeout.
>
> The error code can change, but the time out of the DREQ does indicate
> that a disconnect message was not received from the remote side.

TIMEOUT to me means try again, you might have better luck.  I think in this case, a DREQ timeout means the other side is toast, and is really more like a successful disconnect.

>> The INDConnector::Disconnect call looks like it should map first to a
>> call to IWVConnectEndpoint::NotifyDisconnect, followed by a call to
>> IWVConnectEndpoint::Disconnect.
>
>  You may want to map INDConnector::NotifyDisconnect() map to
> IWVConnectEndpoint::NotifyDisconnect().  The INDConnector::Disconnect()
> call is wanting to map to EP::Disconnect() and QP::Modify().  I.e. the
> ND disconnect overlapped operation is a result of asynchronously wanting
> to modify the QP to error, and not the actual disconnect.

MSMPI doesn't use NotifyDisconnect because it has internal handshaking.  That said, the overlappedness of the INDConnector::Disconnect call is not just for the QP state change - the state change is delayed until the disconnection process is complete.

>> INDConnector::Disconnect is expected to flush all pending requests from
>> the associated INDEndpoint when Disconnect completes (QP transition to
>> ERROR).  The transition to error can't happen when the DREQ is sent
>> only when it either times out or a DREP is received.
>  I thought transitioning to error was okay when sending the DREQ, just
> not when receiving it.  The user should ensure that all previously
> posted sends completed before calling disconnect.

In the case where you have both sides handshaking to disconnect, you *will* run into issues where one side receives the last message, processes it, and moves the QP to error before the sender's HW has received the ACK from the receiver's HW.  When you move the QP to error, the HW stops generating ACKs/retries/etc for that QP, leaving the sender to timeout (which it eventually does) and the send completes in error (retry exceeded.)

Not at all what an app would expect, but that's what the HW does.

So the QP transition to error needs to happen after the IB-level disconnection happens. Sure, you could add an arbitrary delay and hope things would work, but the disconnect handshake at the HW level allows things to tear down politely - the client on each side can be expected to not call Disconnect until all sends have completed locally.  This can delay the DREP (and thus the QP transition at the sender) properly.

-Fab