[Users] IPoIB not working on Windows 2008 r2 - need help

Hal Rosenstock hal.rosenstock at gmail.com
Fri Jun 7 11:43:17 PDT 2013


On Fri, Jun 7, 2013 at 12:29 PM, Orion Poplawski <orion at cora.nwra.com>wrote:

> On 06/07/2013 04:31 AM, Hal Rosenstock wrote:
>
>> You should check what saquery -m 0xc000 says after looking at saquery -g
>> to
>> make sure that the IPoIB broadcast group ( ff12:401b:ffff::ffff:ffff )
>> says
>> MLID 0xc000.
>>
>
> I switched to winOFED 3.1 but I'm seeing the same behavior.  saquery
> reports:
>
>                 PortGid.................fe80::**1:5:ad00:c:5c3d (andrew
> mthca0)
>                 PortGid.................fe80::**1:19:bbff:ff00:5851 (saga
> mthca0)
>                 PortGid.................fe80::**1:19:bbff:ff00:3899 (HP
> Lion Cub 128MB)
>                 PortGid.................fe80::**1:1a:4bff:ff0c:20c9 (HP
> Lion Cub 128MB)
>                 PortGid.................fe80::**5:ad00:c:5ced (MT25204
> InfiniHostLx Mellanox Technologies)
>                 PortGid.................fe80::**1:17:8ff:ffd0:9df9
> (alexandria2 HCA-1)
>
> The windows machine is fe80::5:ad00:c:5ced.
>

OK; it's fine then at SM layer as had been thought earlier by Susan.


>
> Running wireshark on the windows machine again after it stops responding,
> I see the ping requests coming in and arp requests going out for the ip
> address of the remote ping sender, but no arp responses coming in.
>

Would you send me the wireshark file ? I'd like at more details in the
packets. I think there maybe some incompatibility between Windows and Linux
IPoIB. Particularly, I recall something "funny" about MAC address emulation
in Windows.


>
> Running wireshark on the linux ping sender I see lots of:
>
>  24.162827              ->              ARP 192.168.2.5 is at
> 80000404FE80000000000001001A4B**FFFF0C20C9
>  24.162833              ->              ARP 192.168.2.5 is at
> 80000404FE80000000000001001A4B**FFFF0C20C9
>
> But they don't seem to be making it back.
>
> opensm is going crazy on my subnet master, sending a flood of pings to the
> windows machine:
>
To be clear, you mean opensm machine not opensm.

>
> 10:11:29.395397 IP saga.cora.nwra.com > fontdbib.cora.nwra.com: ICMP echo
> request, id 55637, seq 4, length 64
> 10:11:29.395459 ARP, Request who-has fontdbib.cora.nwra.com tell
> saga.cora.nwra.com, length 56
> 10:11:29.395477 ARP, Request who-has fontdbib.cora.nwra.com tell
> saga.cora.nwra.com, length 56
> 10:11:29.395484 ARP, Request who-has fontdbib.cora.nwra.com tell
> saga.cora.nwra.com, length 56
> 10:11:29.395491 IP saga.cora.nwra.com > fontdbib.cora.nwra.com: ICMP echo
> request, id 55637, seq 4, length 64
> 10:11:29.395609 ARP, Request who-has fontdbib.cora.nwra.com tell
> saga.cora.nwra.com, length 56
> 10:11:29.395624 ARP, Request who-has fontdbib.cora.nwra.com tell
> saga.cora.nwra.com, length 56
> 10:11:29.395631 ARP, Request who-has fontdbib.cora.nwra.com tell
> saga.cora.nwra.com, length 56
>
> so I restarted it.
>
> So it is really causing havok with the ib network.
>
> I shutdown the ipoib interface on the windows machine.
>
> I'm still seeing lots of arp traffic on the linux box I was pinging from
> though, which seems crazy:
>
> 10:28:35.836691 ARP, Request who-has fontdbib.cora.nwra.com tell
> earthib.cora.nwra.com, length 56
> 10:28:35.836696 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836700 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836704 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836708 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836713 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836717 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836721 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836726 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836730 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836734 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836739 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836743 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836747 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836752 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836756 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836760 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836765 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836769 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836795 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836804 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836808 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836813 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836817 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836822 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836827 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836832 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836836 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> 10:28:35.836850 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
> ^C10:28:35.836855 ARP, Reply earthib.cora.nwra.com is-at
> 80:00:04:04:fe:80:00:00:00:00:**00:01:00:1a:4b:ff:ff:0c:20:c9, length 56
>
>
> why so bloody many arp replies?
>
 It might help if you would tell hostname for the various machines
particularly the ones running OpenSM and the Windows machine.

>
> There is an over active ib_mad1 process on the machine as well.
>
>
>

What is your linux machine's configuration ? Is it a dual port
configuration ? What is running on each port ? Is there an SM on the second
port ?
-- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130607/c4d55436/attachment.html>


More information about the Users mailing list