[libfabric-users] Help with mrail
Latham, Robert J.
robl at mcs.anl.gov
Thu Nov 21 14:12:08 PST 2019
I'm trying to figure out how to use libfabric's muti-rail support to
drive the multiple infiniband cards on the ORNL Summit machine.
There is only one netdev interface name on these nodes (ib0), but each
node can see four infiniband ports. The HCAs on Summit have two
ports. I want to use 'ofi_mrail' to use those two ports
How am I suposed to specify FI_OFI_MRAIL_ADDR in a scenario where I
don't know ahead of time which hosts I will run on?
- `FI_OFI_MRAIL_ADDR=ib0,ib3` won't work: there is only one ib0 device
- 'ibv_devices' reports four devices: 'mlx5_0', 'mlx5_1', mlx5_2', and
'mlx5_3' . Is it possible to express those devices in an
FI_ADDR_STR? 'fi_verbs://' was about as far as I got. Can I just use
e.g. "fi_verbs://mlx5_0,fi_verbs://mlx5_2" ?
- 'fi_info -v -p verbs` shows those mlx5_[0-3] devices in the 'domain'
field , but only mlx5_0 has a non-null `src_addr`. So it seems to me
that libfabric knows about these ports. I'm just not sure how to ask
it to use them.
For background, I'm going to quote liberally from a just-published SC19
paper that describes the Summit topology far better than I could 
The [Infiniband card] connects directly to both CPUs and both CPUs can
directly use both ports. When Linux enumerates the PCI tree, each CPU
reports both ports on the HCA so Linux recognizes four ports. Socket 0
maps virtual ports V0 and V1 to the physical ports P0 and P1 and socket
1 maps virtual ports V2 and V3 also to P0 and P1 as shown in Figure 2.
The figure also shows socket0 striping over virtual ports V0 and V3. In
this case, the V3 data will cross the SMP bus from socket 0 to socket 1
and then down PCIe to physical port P1. By default, processes on Summit
do not stripe and all socket 0 traffic uses virtual port V0 and all
socket 1 traffic uses virtual port V3.
(I've attached 'Figure 2' , referenced in the above text)
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 33991 bytes
More information about the Libfabric-users