[Openib-windows] Win IBhost stop receive broadcast packets

Anatoly Lisenko anatolyl at voltaire.com
Tue Jan 2 03:56:50 PST 2007


Hi ,

 

I saw some problem with windows ibhost stack: reboot of infiniband
switch can cause ping loss ( even after ibsw get up ).

I start to research this anomaly and I saw:

1. ib stack doesn't receive broadcast arp packets.

2. All other packets unicast + multicast are received.

3. rx packets hca port counter increased each time broadcast packet
arrived 

4. It seems that firmware drop this packet. ( I don't see any
completions )

 

I examined the logs and saw that somehow we fall into state when :

1. hca's port joined to bcast group

2. ipoib qp detached from bcast group

 

This is stack backtrace of mlnx_detach_mcast func. :

f7125d10 f68ae6ab mthca!mlnx_detach_mcast+0x13
[n:\win-ibhost\trunk\hw\mthca\kernel\hca_mcast.c @ 142]

f7125d38 f68c9c30 ibbus!__cleanup_mcast+0x24b
[n:\win-ibhost\trunk\core\al\al_mcast.c @ 304]

f7125d70 f6820212 ibbus!async_destroy_cb+0x420
[n:\win-ibhost\trunk\core\al\al_common.c @ 665]

f7125d8c f6825dc2 ibbus!__cl_async_proc_worker+0x92
[n:\win-ibhost\trunk\core\complib\cl_async_proc.c @ 153]

f7125da0 f6827c3a ibbus!__cl_thread_pool_routine+0x52
[n:\win-ibhost\trunk\core\complib\cl_threadpool.c @ 67]

f7125dac 80948bb2 ibbus!__thread_callback+0x2a
[n:\win-ibhost\trunk\core\complib\kernel\cl_thread.c @ 49]

f7125ddc 8088d4d2 nt!PspSystemThreadStartup+0x2e

00000000 00000000 nt!KiThreadStartup+0x16

 

 

Mthca wpp log:

00000662          kernel   1236     600       2          312
01\02\2007-13:28:02:781            mlnx_query_ca()===>

00000663          kernel   1236     600       2          321
01\02\2007-13:28:02:781            mlnx_query_ca() :port 0 gid0:

00000664          kernel   1236     600       2          322
01\02\2007-13:28:02:781            mlnx_query_ca() :
0xfe80000000-0x08f14398095

00000665          kernel   1236     600       2          323
01\02\2007-13:28:02:781            mlnx_query_ca() :port 1 gid0:

00000666          kernel   1236     600       2          324
01\02\2007-13:28:02:781            mlnx_query_ca() :
0xfe80000000-0x08f14398096

00000667          kernel   1236     600       2          325
01\02\2007-13:28:02:781            mlnx_query_ca() :Space required 1898
used 1898

00000668          kernel   1236     600       2          326
01\02\2007-13:28:02:781            mlnx_conv_hca_cap() :Port 1 port_guid
0x8f10403980095

00000669          kernel   1236     600       2          327
01\02\2007-13:28:02:781            mlnx_conv_hca_cap() :Port 2 port_guid
0x8f10403980096

00000670          kernel   1236     600       2          328
01\02\2007-13:28:02:781            mlnx_query_ca()<===

00000671          kernel   4          276       2          339
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>

00000672          kernel   4          276       2          340
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 89930EA8,
qp_p 88A56E78, mlid c0, mgid ffff1b4012ff`ffffffff00000000

00000678          kernel   4          276       2          346
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS

00000681          kernel   4          276       2          349
01\02\2007-13:28:02:859            mlnx_enable_cq_notify()===>

00000682          kernel   4          276       2          350
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS

00000683          kernel   4          276       2          357
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>

00000684          kernel   4          276       2          358
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 898F1D68,
qp_p 88A56E78, mlid 1c0, mgid ffff1b4012ff`100000000000000

00000685          kernel   4          276       2          359
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS

00000686          kernel   4          276       2          362
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>

00000687          kernel   4          276       2          363
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 884352D8,
qp_p 88A56E78, mlid 2c0, mgid ffff051412ff`30000c280010000

00000688          kernel   4          276       2          364
01\02\2007-13:28:02:859            completes with ERROR status
IB_SUCCESS

00000689          kernel   0          0          3          129
01\02\2007-13:28:02:750            mlnx_enable_cq_notify()===>

00000690          kernel   0          0          3          130
01\02\2007-13:28:02:750            completes with ERROR status
IB_SUCCESS

...

00000776          kernel   0          0          3          373
01\02\2007-13:28:03:109            mlnx_enable_cq_notify()===>

00000777          kernel   0          0          3          374
01\02\2007-13:28:03:109            completes with ERROR status
IB_SUCCESS

00000778          kernel   4          272       3          375
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 89918F40,
qp_p 88A56E78, mlid 2c0, mgid ffff051412ff`30000c280010000

00000779          kernel   4          272       3          376
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS

00000780          kernel   4          272       3          377
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88DB3F00,
qp_p 88A56E78, mlid 1c0, mgid ffff1b4012ff`100000000000000

00000781          kernel   4          272       3          378
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS

00000782          kernel   4          272       3          379
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88A48DA0,
qp_p 88A56E78, mlid 6c0, mgid ffff051412ff`ffffa8ff00ff0000

00000783          kernel   4          272       3          380
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS

00000784          kernel   4          272       3          381
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88A94DD0,
qp_p 88A56E78, mlid c0, mgid ffff1b4012ff`ffffffff00000000

00000785          kernel   4          272       3          382
01\02\2007-13:28:03:296            completes with ERROR status
IB_SUCCESS

00000786          kernel   4          280       1          383
01\02\2007-13:28:22:781            mlnx_query_ca()===>

00000787          kernel   4          280       1          384
01\02\2007-13:28:22:781            mlnx_query_ca() :port 0 gid0:

00000788          kernel   4          280       1          385
01\02\2007-13:28:22:781            mlnx_query_ca() :
0xfe80000000-0x08f14398095

00000789          kernel   4          280       1          386
01\02\2007-13:28:22:781            mlnx_query_ca() :port 1 gid0:

00000790          kernel   4          280       1          387
01\02\2007-13:28:22:781            mlnx_query_ca() :
0xfe80000000-0x08f14398096

00000791          kernel   4          280       1          388
01\02\2007-13:28:22:781            mlnx_query_ca() :Space required 1898
used 1898

00000792          kernel   4          280       1          389
01\02\2007-13:28:22:781            mlnx_conv_hca_cap() :Port 1 port_guid
0x8f10403980095

00000793          kernel   4          280       1          390
01\02\2007-13:28:22:781            mlnx_conv_hca_cap() :Port 2 port_guid
0x8f10403980096

00000794          kernel   4          280       1          391
01\02\2007-13:28:22:781            mlnx_query_ca()<===

 

 

Ipoib wpp log:

 

00000130          kernel   0          0          0          130
01\02\2007-13:28:01:468            [IPoIB] :ipoib_check_for_hang():]

00000131          kernel   4          280       0          133
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_reset_all():[

00000132          kernel   4          280       0          134
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():[

00000133          kernel   4          280       0          135
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():]

...

00000140          kernel   4          280       0          150
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():[

00000141          kernel   4          280       0          151
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():]

00000142          kernel   4          280       0          152
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_reset_all():]

00000143          kernel   4          280       0          160
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_down():]

00000144          kernel   4          280       1          131
01\02\2007-13:28:02:781            [IPoIB] :__ipoib_pnp_cb() :Link DOWN!

00000145          kernel   4          280       1          132
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_down():[

00000146          kernel   4          312       1          153
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000147          kernel   4          312       1          154
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group

00000148          kernel   4          312       1          164
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000149          kernel   4          312       1          165
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000150          kernel   4          312       1          166
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000151          kernel   4          280       1          170
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_up():[

00000152          kernel   4          280       1          171
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_up():]

00000153          kernel   4          308       2          145
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000154          kernel   4          308       2          147
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group

00000155          kernel   4          308       2          161
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000156          kernel   4          308       2          162
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000157          kernel   4          308       2          163
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000158          kernel   4          276       2          191
01\02\2007-13:28:02:859            [IPoIB] :__bcast_cb():[

00000159          kernel   4          276       2          192
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_add_bcast():[

00000160          kernel   4          276       2          193
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[

00000161          kernel   4          276       2          194
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]

00000162          kernel   4          276       2          195
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[

00000163          kernel   4          276       2          196
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast()
:Create av for MAC: 00-00-00-00-00-00

00000164          kernel   4          276       2          197
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[

00000165          kernel   4          276       2          198
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]

00000166          kernel   4          276       2          199
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]

00000167          kernel   4          276       2          200
01\02\2007-13:28:02:859            [IPoIB]
:__endpt_mgr_insert_locked():[

00000168          kernel   4          276       2          201
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: FF-FF-FF-FF-FF-FF

00000169          kernel   4          276       2          202
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[

00000170          kernel   4          276       2          203
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]

00000171          kernel   4          276       2          204
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_add_bcast():]

00000172          kernel   4          276       2          205
01\02\2007-13:28:02:859            [IPoIB] :__ib_mgr_activate():[

00000173          kernel   4          276       2          206
01\02\2007-13:28:02:859            [IPoIB] :__ib_mgr_activate():]

00000174          kernel   4          276       2          207
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active():[

00000175          kernel   4          276       2          208
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():[

00000176          kernel   4          276       2          209
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref():[

00000177          kernel   4          276       2          210
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Look for
:     MAC: 01-00-5E-00-00-01

00000178          kernel   4          276       2          211
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Failed
endpoint lookup.[IpoIB] :__endpt_mgr_ref():]

00000179          kernel   4          276       2          212
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[

00000180          kernel   4          276       2          213
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]

00000181          kernel   4          276       2          214
01\02\2007-13:28:02:859            [IPoIB]
:__endpt_mgr_insert_locked():[

00000182          kernel   4          276       2          215
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: 01-00-5E-00-00-01

00000183          kernel   4          276       2          216
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[

00000184          kernel   4          276       2          217
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]

00000185          kernel   4          276       2          218
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():]

00000186          kernel   4          276       2          219
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():[

00000187          kernel   4          276       2          220
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref():[

00000188          kernel   4          276       2          221
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Look for
:     MAC: 01-80-C2-00-00-03

00000189          kernel   4          276       2          222
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Failed
endpoint lookup.[IpoIB] :__endpt_mgr_ref():]

00000190          kernel   4          276       2          223
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[

00000191          kernel   4          276       2          224
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]

00000192          kernel   4          276       2          225
01\02\2007-13:28:02:859            [IPoIB]
:__endpt_mgr_insert_locked():[

00000193          kernel   4          276       2          226
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: 01-80-C2-00-00-03

00000194          kernel   4          276       2          227
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[

00000195          kernel   4          276       2          228
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]

00000196          kernel   4          276       2          229
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():]

00000197          kernel   4          276       2          230
01\02\2007-13:28:02:859            [IPoIB] :ipoib_resume_oids():[

00000198          kernel   4          276       2          231
01\02\2007-13:28:02:859            [IPoIB] :ipoib_resume_oids():]

00000199          kernel   4          276       2          232
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active() :Link UP!

00000200          kernel   4          276       2          233
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active():]

00000201          kernel   4          276       2          234
01\02\2007-13:28:02:859            [IPoIB] :__bcast_cb():]

00000202          kernel   4          276       2          235
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():[

00000203          kernel   4          276       2          236
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[

00000204          kernel   4          276       2          237
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast()
:Create av for MAC: 01-00-5E-00-00-01

00000205          kernel   4          276       2          238
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[

00000206          kernel   4          276       2          239
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]

00000207          kernel   4          276       2          240
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]

00000208          kernel   4          276       2          241
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():]

00000209          kernel   4          276       2          242
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():[

00000210          kernel   4          276       2          243
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[

00000211          kernel   4          276       2          244
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast()
:Create av for MAC: 01-80-C2-00-00-03

00000212          kernel   4          276       2          245
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[

00000213          kernel   4          276       2          246
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]

00000214          kernel   4          276       2          247
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]

00000215          kernel   4          276       2          248
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():]

00000216          kernel   4          320       3          136
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000217          kernel   4          320       3          139
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000218          kernel   4          320       3          141
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000219          kernel   4          320       3          143
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000220          kernel   4          320       3          144
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000221          kernel   4          320       3          146
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group

00000222          kernel   4          320       3          155
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000223          kernel   4          320       3          156
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000224          kernel   4          320       3          157
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000225          kernel   4          320       3          158
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000226          kernel   4          320       3          159
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving
MCast group

00000227          kernel   4          320       3          167
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000228          kernel   4          320       3          168
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000229          kernel   4          320       3          169
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000230          kernel   0          0          3          172
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb():[

00000231          kernel   0          0          3          173
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_add_local():[

00000232          kernel   0          0          3          174
01\02\2007-13:28:02:781            [IPoIB] :ipoib_endpt_create():[

00000233          kernel   0          0          3          175
01\02\2007-13:28:02:781            [IPoIB] :ipoib_endpt_create():]

00000234          kernel   0          0          3          176
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_insert():[

00000235          kernel   0          0          3          177
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_insert():]

00000236          kernel   0          0          3          178
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_add_local():]

00000237          kernel   0          0          3          179
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb() :Received
port info: link width = 2.

00000238          kernel   0          0          3          180
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate():[

00000239          kernel   0          0          3          181
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate() :Link speed
is 2.5Gs

00000240          kernel   0          0          3          182
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate() :Link width
is 4X

00000241          kernel   0          0          3          183
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate():]

00000242          kernel   0          0          3          184
01\02\2007-13:28:02:781            [IPoIB] :__port_get_bcast():[

00000243          kernel   0          0          3          185
01\02\2007-13:28:02:781            [IPoIB] :__port_get_bcast():]

00000244          kernel   0          0          3          186
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb():]

00000245          kernel   2624     2732     3          187
01\02\2007-13:28:02:781            [IPoIB] :__bcast_get_cb():[

00000246          kernel   2624     2732     3          188
01\02\2007-13:28:02:781            [IPoIB] :__port_join_bcast():[

00000247          kernel   2624     2732     3          189
01\02\2007-13:28:02:781            [IPoIB] :__port_join_bcast():]

00000248          kernel   2624     2732     3          190
01\02\2007-13:28:02:781            [IPoIB] :__bcast_get_cb():]

00000249          kernel   0          0          3          249
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():[

00000250          kernel   0          0          3          250
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():]

...

00000339          kernel   0          0          3          339
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():[

00000340          kernel   0          0          3          340
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():]

00000341          kernel   0          0          0          341
01\02\2007-13:28:03:468            [IPoIB] :ipoib_check_for_hang():[

 

 

Thanks,

Anatoly

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070102/9e5263c3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mthca_wpp_flags_0x400.log
Type: application/octet-stream
Size: 29023 bytes
Desc: mthca_wpp_flags_0x400.log
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070102/9e5263c3/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ipoib_wpp_flag_0x122.log
Type: application/octet-stream
Size: 23777 bytes
Desc: ipoib_wpp_flag_0x122.log
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070102/9e5263c3/attachment-0001.obj>


More information about the ofw mailing list