[openib-general] SDP RDMA support for synchronous socket operations

Michael S. Tsirkin mst at mellanox.co.il
Wed Aug 3 06:55:46 PDT 2005


Libor, all,
Tomorrow, I plan to start working on sdp zcopy support for synchronous
send/recv socket operations. Both kernel-level and user-level initiators
should continue to be supported. I dont plan to work on sendfile
support, yet.

I hope to finish the implementation and some basic testing in the coming two
weeks time. The development will be done on a branch, to be opened tomorrow.
My plan is to merge updates from trunk to stay in sync as much as possible.

What follows is a raw design draft. Comments are welcome.

MST

------------------------------------------------------------------------

What needs to be done to support zcopy for synchronous send/recv socket
operations.

Draft rev 2

Currently only AIO is supported for Zcopy.
We reuse the ZCopy infrastructure for send/recv socket operations.

Main differences between AIO and send/recv operations:

- send/recv have more flags: MSG_DONTWAIT, MSG_OOB, MSG_WAITALL, MSG_PEEK.
- after a send/recv call returns, its illegal for the HCA to touch the
  application's buffer.
- send/recv must support data sizes too big to be transferred in
  one infiniband operation (SOCK_STREAM applications dont seem
  to expect to get EMSGSIZE).
- send/recv have the ability to block until an operation completes.
  This has to be implemented by SDP.
- with send/recv, operation is revoked with a signal, unlike aio which
  is canceled explicitly by the application.
- typically, there is only one outstanding send/recv operation on a
  specific socket
- send/recv must be supported for kernel and user-space consumers.
  current aio code seems to only support user-level consumers.


Design draft covers: Send side, Receive side, Send/Send deadlock prevention:

----------------------- 

Send side:

Operation:

- Attempt zcopy if the message is bigger than send bcopy threshold

- If the operation is too big to fit in a single FMR,
  split it to multiple buffers (iocb), queue them for processing.
  Q: limit the number of FMRs used by a single socket?

  Block till the last iocb completes.

- If no FMRs are available
  Force bcopy transfer

- On signal, locate and cancel all queued iocbs.
  This may need to block, in which case we block in
  uninterruptible state (with a timeout)

  If iocbs cant be canceled within a predefined time,
  treat this as a transport error, trigger an abortive close


Options/Socket flags:

- MSG_OOB out of band data
  Force bcopy transfer
  Q: What to do if src avail are outstanding?
  A: 

- MSG_DONTWAIT/O_NONBLOCK non-blocking operation
  Force bcopy transfer

----------------------- 

Receive side:


Operation:

- Attempt zcopy if the message is bigger than rcv bcopy threshold

- If the operation is too big to fit in a single FMR,
  split it to multiple buffers (iocb), queue them for processing.

  Block till the last iocb completes.

- With MSG_WAITALL:
  Dont post sink available.

- Without MSG_WAITALL:
  Post exactly one sink available at a time.
  Registration can still be pipelined with RDMA.
  Note that recv without MSG_WAITALL may return a shorter message
  than what was sent. This is OK:
  "For stream-based sockets, such as SOCK_STREAM, message boundaries shall
   be ignored. In this case, data shall be returned to the user as soon as it
   becomes available, and no data shall be discarded."

- If no FMRs are available
  Force bcopy transfer

- On signal, locate and cancel all queued iocbs
  This may need to block, in which case we block in
  uninterruptible state (with a timeout)

  If iocbs cant be canceled within a predefined time,
  treat this as a transport error, trigger an abortive close


Options/Socket flags:

- MSG_OOB out of band data
  Handle as it arrives
  Q: Should we force bcopy transfer (SendSM)?
  A: 

- MSG_DONTWAIT/O_NONBLOCK non-blocking operation
  Force bcopy transfer

- MSG_PEEK peek data in buffer
  Force bcopy transfer

----------------------- 
Send/Send deadlock prevention: quoting SDP spec:

Receive side detects deadlock if:
. A SrcAvail is received; and
. No ULP receive buffer is posted; and
. The local Data Source has a SrcAvail outstanding.

There are several ways to resolve this deadlock.
Resolve in the following way:
. The Data Sink could send a SendSm message to force the use of
  the Bcopy data transfer mechanism.

-- 
MST



More information about the general mailing list