[openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer

Helen Chen hycsw at ca.sandia.gov
Thu Oct 13 13:38:31 PDT 2005


Roland,

Thank you for your response.  That fixed my initial buffer
allocation failure.  After we tuned the Lustre and reran 
same IOZONE tests again, we got the following problem.
Was there an actual network interrupt? If so, the problem
is not obvious now; the two nodes are pinging over IPoIB.
Please advice.

Thanks,
Helen

---- Dmesg Report from Lustre server -----
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 1846
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 2846
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 3846
Lustre: A connection with 192.168.2.79 timed out; the network or that node may be down.
LustreError: 10501:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024f ip 192.168.2.79:1021
LustreError: 10793:0:(ldlm_lib.c:506:target_handle_reconnect()) 460e5_lov2_7d3910bb5c reconnecting

----- Dmesg from Lustre client (192.168.2.79) ------
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 1965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 2965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 3965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 4965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 5965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 6965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 7965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 8965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 9965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 10965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 11965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 12965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 13965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 14965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 15965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 16965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 17965
Lustre: 10035:0:(socknal_cb.c:1326:ksocknal_process_receive()) [f6256000] EOF from 0xc0a80253 ip 192.168.2.83:988
LustreError: 10169:0:(client.c:568:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -107 req at d3585600 x13853/t0
o400->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 64/64 ref 1 fl Rpc:RN/0/0 rc 0/-107
LustreError: Connection to service on5-ost2 via nid 192.168.2.76 was lost; in progress operations using this service will wait for recovery to
complete.
Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection lost to on5-ost2_UUID at NID_on5-ib_UUID
LustreError: This client was evicted by on5-ost2; in progress operations using this service will fail.
LustreError: 10413:0:(rw.c:1253:ll_readpage()) page c1538cc0 map f6193328 index 825344 flags 20001023 count 3 priv e91da940: lock match failed: rc -5
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d01f2200 x13862/t0 o3->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 328/280
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d51ea400 x13868/t0 o3->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 328/280
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 4 similar messages
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d3c7ea00 x13880/t0 o3->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 328/280
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 11 similar messages
Lustre: A connection with 192.168.2.75 timed out; the network or that node may be down.
LustreError: 10041:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024b ip 192.168.2.75:988
Lustre: Connection restored to service on5-ost2 using nid 192.168.2.76.
Lustre: 10496:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection restored to
on5-ost2_UUID at NID_on5-ib_UUID
LustreError: 10169:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129234515, 101s ago) req at f6233e00 x13850/t0
o400->on12-mds2_UUID at NID_on12-ib_UUID:12 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: Connection to service on12-mds2 via nid 192.168.2.83 was lost; in progress operations using this service will wait for recovery to
complete.
Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) MDC_on8_on12-mds2_MNT_on8-ib_2: connection lost to on12-mds2_UUID at NID_on12-ib_UUID
Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74.
Lustre: 10170:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on3-ost2_MNT_on8-ib_2: connection restored to
on3-ost2_UUID at NID_on3-ib_UUID




More information about the general mailing list