[libfabric-users] gni assertion

Howard Pritchard hppritcha at gmail.com
Fri Apr 14 09:56:22 PDT 2017


Hi John,

The control messages that the GNI provider uses internally (GNI SMSG
messages) use a circular queue
mechanism with credits to protect from overrun of the queue.  Also, for the
RX hardware
CQs - which can get overrun if you are sending many small messages to a
single endpoint - there's a plan
B fallback to go through all of the mailboxes (the circular queues) and
look for new messages until the
RX CQ is no longer in overflow state.  You can think of the RX CQ as an
optimization for notifying us which
mailbox to check for new messages.  This procedure has been used for years
in the Cray MPI (since the first Cray XE's came out), so its gotten lots of
testing over the years.

Small messages injected via fi_tsend and friends use the SMSG protocol
while larger messages go
through a combination of SMSG control message and RDMA reads.

Hope this helps some,

Howard



2017-04-14 6:54 GMT-06:00 Biddiscombe, John A. <biddisco at cscs.ch>:

> Howard
>
>
>
> >
>
> This looks like heap corruption somehow.  Could you try rebuilding
> libfabric
>
> with --enable-debug and set FI_LOG_LEVEL to warn and see if that gives
>
> more info?
>
> <
>
>
>
> After making some changes to our code, the problem seems to have gone
> away. We limit the number of messages that a node can send at a time to a
> smaller number  and things seem to behave better.
>
>
>
> A more general question though:
>
> Does libfabric have any flow control mechanism built in? If I send a large
> number of message from many nodes to one single node - once the preposted
> receives are exhausted - what behaviour can I expect from libfabric. Will
> messages be resent - or will the network layer transition into an error
> state from which it is difficult to recover.
>
>
>
> Experiments indicate that libfabric is handling large numbers of messages
> without returning errors, but I’m curious to know what knobs/controls exist
> to allow us to adjust behaviour and mange flow control.
>
>
>
> Thanks
>
>
>
> JB
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170414/795e537a/attachment.html>


More information about the Libfabric-users mailing list