[ofa-general] mlx4: errors and failures on OOM

Sat Apr 11 13:33:50 PDT 2009

Hello,

last week I had issues with Lustre failures, which turned out to be 
failures of many clients, which run into out-of-memory due to bad user space jobs 
(and no protection again that by the queuing system).

Anyway, I don't think IB is supposed to fail, when the oom killer activates.

Errors for 0x001b0d0000008ede "Cisco Switch"
   5: [XmtDiscards == 270]
         Link info:     38    5[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a81    1[  ] "eul0605 HCA-1"
   16: [XmtDiscards == 132]
         Link info:     38   16[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a01    1[  ] "eul0616 HCA-1"

I used a script to monitor the fabric for failures every 5 min and just when the oom
killer activated on the clients the messages above came up. Below are syslogs from
one of these clients

Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50173 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit 
300s).
Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 30 previous similar messages
Apr  4 08:50:38 eul0605 kernel: LustreError: 166-1: MGC172.17.31.247 at o2ib: Connection to service MGS via nid 172.17.31.247 at o2ib was lost; in 
progress operations using this service will fail.
Apr  4 08:50:38 eul0605 kernel: Lustre: home1-MDT0000-mdc-0000010430fa0800: Connection to service home1-MDT0000 via nid 172.17.31.247 at o2ib was 
lost; in progress operations using this service will wait for recovery to complete.
Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 7 previous similar messages
Apr  4 08:50:38 eul0605 kernel: Lustre: tmp-OST0003-osc-0000010423750000: Connection to service tmp-OST0003 via nid 172.17.31.231 at o2ib was lost; in 
progress operations using this service will wait for recovery to complete.
Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 29 previous similar messages
Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 000001041bcbb800 x50205/t0 
o250->MGS at 172.17.31.247@o2ib:26/25 lens 304/456 e 0 to 1 dl 1238828031 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages
Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50205 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit 
300s).

===> So somehow lustre lost the network connection. On the server side the   
logs simply show this node didn't answer to pings anymore.

Apr  4 08:52:58 eul0605 kernel: Lustre: Skipped 31 previous similar messages
Apr  4 08:52:59 eul0605 kernel: Lustre: Changing connection for MGC172.17.31.247 at o2ib to MGC172.17.31.247 at o2ib_1/172.17.31.246 at o2ib
Apr  4 08:52:59 eul0605 kernel: Lustre: Skipped 61 previous similar messages
Apr  4 08:53:00 eul0605 kernel: oom-killer: gfp_mask=0xd2

[...]

Apr  4 08:53:05 eul0605 kernel: Out of Memory: Killed process 10612 (gamos).
Apr  4 08:53:10 eul0605 kernel: 3212 pages swap cached
Apr  4 08:53:10 eul0605 kernel: Out of Memory: Killed process 10292 (tcsh).

===> And here we see, gamos consumed all memory again.  

Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 0000010430f8f800 x50237/t0 
o250->MGS at MGC172.17.31.247@o2ib_1:26/25 lens 304/456 e 0 to 1 dl 1238828107 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages
Apr  4 08:53:10 eul0605 kernel: Lustre: Request x50237 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.246 at o2ib 50s ago has timed out (limit 
300s).
Apr  4 08:53:10 eul0605 kernel: Lustre: Skipped 31 previous similar messages
Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11

===> So we see the reason why Lustre lost network connection - infiniband is down.  

In most cases IB recovers from that situation, not always. If it then entirely 
fails, ibnetdiscover or ibclearerrors will report that can't resolve the route
to these nodes.

This with drivers from ofed-1.3.1. Any ideas why OOM causes issues with IB?

Thanks,
Bernd