[ofa-general] general questions about librdmacm
Scott M. Ferris
sferris at acm.org
Fri Sep 26 08:54:08 PDT 2008
On Wed, Sep 24, 2008 at 03:43:56PM -0700, Steven Dake wrote:
>
> Totem is a reliable virtual synchrony multicast protocol which transmits
> a message from any node to all nodes in a collection of computers
> (called the configuration or membership). It has a few requirements:
>
> unreliable datagram multicast
> unreliable datagram unicast
> ability to bind to a specific port and interface
> ability to poll() (POLLIN) via system call for new multicast datagram
> messages
It's been a few years since I read the totem paper, but If I recall
correctly, totem also has certain ordering requirements about unicast
and multicast, which the paper asserts are true for ethernet.
It's not clear to me that Infiniband will provide the same guarantees
in all cases. If unicasts and multicasts are sent on different queue
pairs, I'm not sure any ordering is guaranteed. I can also imagine
the IB virtual lane (VL) feature potentially reordering delivery if
messages end up in different VLs. I'd recommend talking to someone
more knowledgable in Infiniband than I am to check what you need to do
to meet the totem ordering requirement.
> Few questions:
> 1) I would like to continue to use IP addressing but it looks like I
> have to use a different addressing model in librdmacm. I looked at the
> examples in the library and it isn't clear to me whether they use IP
> addressing or some other addressing model. I see references to IPoverIB
> but I don't see any information in the wiki on the topic. Anyone have
> links to documentation on the topic of node addressing?
IPoIB is fairly transparent to you. The OFED software provides Linux
netdevices (e.g. ib0, ib1) which pass IP traffic just like an ethernet
netdevice would. Normal IP routing controls what interface gets used.
For the most part IP-based software just works. Low-level things like
DHCP daemons will notice differences, since the hardware addresses are
larger.
If IPoIB connected mode is being used, unicasts and multicasts will be
sent on different queue pairs, since the unicasts will use a
connection, and the multicasts can't. You may need to disable IPoIB
connected mode in order to get the ordering guarantees totem needs,
since I'm not sure any ordering is guaranteed between messages queued
to different queue pairs.
If you want to use native IB protocols instead of IPoIB, librdmacm
will be easier than using libibcm. The RDMA communication manager
uses IPoIB to resolve IP addresses to native IB addresses, so you can
avoid changing your addressing model. The rdma_cm(7) man page
describes how to bind and connect.
> 2) The library doesn't have any non blocking (kernel wait queue based)
> polling mechanism that I can see. Am I missing a call here?
Look at ibv_create_comp_channel(3) and ibv_get_cq_event(3).
> 3) Of course using the standard socket API would be highly desired as
> it requires less code changes. Is there some other library I should be
> using?
IPoIB gives you the standard socket API.
RDMA CM uses a similiar set of calls for binding and connecting,
though there are some differences. rdma_cm_ids are somewhat like
socket fds. Once you have a multicast or unicast rmda_cm_id, you use
the IB verbs API (libibverbs) to send and receive.
--
Scott M. Ferris,
sferris at acm.org
More information about the general
mailing list