[Openib-windows] Win IBhost stop receive broadcast packets

Yossi Leybovich sleybo at dev.mellanox.co.il
Tue Jan 2 05:11:06 PST 2007


Is it reproduce able ? how ?
Does the SM resides on the switch ?
 
 
I think that there was error in the join process against the SM.
In normal behavior of IPoIB __bcast_cb() should be called after IPoIB issued
query of __port_join_bcast()
 
In your log there is not call to the __bcast_cb()  callback, this mean that
IBAL did not get the answer and failed to create time out for the query.
 
Can you collect IBAL traces so we can be sure that the query returned ?
Can you also get IB traces between the SM and the IPoIB ?


  _____  

From: openib-windows-bounces at openib.org
[mailto:openib-windows-bounces at openib.org] On Behalf Of Anatoly Lisenko
Sent: Tuesday, January 02, 2007 1:57 PM
To: leonid at mellanox.co.il; openib-windows at openib.org
Cc: Tzahi Oved
Subject: [Openib-windows] Win IBhost stop receive broadcast packets



Hi ,

 

I saw some problem with windows ibhost stack: reboot of infiniband switch
can cause ping loss ( even after ibsw get up ).

I start to research this anomaly and I saw:

1. ib stack doesn't receive broadcast arp packets.

2. All other packets unicast + multicast are received.

3. rx packets hca port counter increased each time broadcast packet arrived 

4. It seems that firmware drop this packet. ( I don't see any completions )

 

I examined the logs and saw that somehow we fall into state when :

1. hca's port joined to bcast group

2. ipoib qp detached from bcast group

 

This is stack backtrace of mlnx_detach_mcast func. :

f7125d10 f68ae6ab mthca!mlnx_detach_mcast+0x13
[n:\win-ibhost\trunk\hw\mthca\kernel\hca_mcast.c @ 142]

f7125d38 f68c9c30 ibbus!__cleanup_mcast+0x24b
[n:\win-ibhost\trunk\core\al\al_mcast.c @ 304]

f7125d70 f6820212 ibbus!async_destroy_cb+0x420
[n:\win-ibhost\trunk\core\al\al_common.c @ 665]

f7125d8c f6825dc2 ibbus!__cl_async_proc_worker+0x92
[n:\win-ibhost\trunk\core\complib\cl_async_proc.c @ 153]

f7125da0 f6827c3a ibbus!__cl_thread_pool_routine+0x52
[n:\win-ibhost\trunk\core\complib\cl_threadpool.c @ 67]

f7125dac 80948bb2 ibbus!__thread_callback+0x2a
[n:\win-ibhost\trunk\core\complib\kernel\cl_thread.c @ 49]

f7125ddc 8088d4d2 nt!PspSystemThreadStartup+0x2e

00000000 00000000 nt!KiThreadStartup+0x16

 

 

Mthca wpp log:

00000662          kernel   1236     600       2          312
01\02\2007-13:28:02:781            mlnx_query_ca()===>

00000663          kernel   1236     600       2          321
01\02\2007-13:28:02:781            mlnx_query_ca() :port 0 gid0:

00000664          kernel   1236     600       2          322
01\02\2007-13:28:02:781            mlnx_query_ca() :
0xfe80000000-0x08f14398095

00000665          kernel   1236     600       2          323
01\02\2007-13:28:02:781            mlnx_query_ca() :port 1 gid0:

00000666          kernel   1236     600       2          324
01\02\2007-13:28:02:781            mlnx_query_ca() :
0xfe80000000-0x08f14398096

00000667          kernel   1236     600       2          325
01\02\2007-13:28:02:781            mlnx_query_ca() :Space required 1898 used
1898

00000668          kernel   1236     600       2          326
01\02\2007-13:28:02:781            mlnx_conv_hca_cap() :Port 1 port_guid
0x8f10403980095

00000669          kernel   1236     600       2          327
01\02\2007-13:28:02:781            mlnx_conv_hca_cap() :Port 2 port_guid
0x8f10403980096

00000670          kernel   1236     600       2          328
01\02\2007-13:28:02:781            mlnx_query_ca()<===

00000671          kernel   4          276       2          339
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>

00000672          kernel   4          276       2          340
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 89930EA8,
qp_p 88A56E78, mlid c0, mgid ffff1b4012ff`ffffffff00000000

00000678          kernel   4          276       2          346
01\02\2007-13:28:02:859            completes with ERROR status IB_SUCCESS

00000681          kernel   4          276       2          349
01\02\2007-13:28:02:859            mlnx_enable_cq_notify()===>

00000682          kernel   4          276       2          350
01\02\2007-13:28:02:859            completes with ERROR status IB_SUCCESS

00000683          kernel   4          276       2          357
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>

00000684          kernel   4          276       2          358
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 898F1D68,
qp_p 88A56E78, mlid 1c0, mgid ffff1b4012ff`100000000000000

00000685          kernel   4          276       2          359
01\02\2007-13:28:02:859            completes with ERROR status IB_SUCCESS

00000686          kernel   4          276       2          362
01\02\2007-13:28:02:859            mlnx_attach_mcast()===>

00000687          kernel   4          276       2          363
01\02\2007-13:28:02:859            mlnx_attach_mcast() :mcasth 884352D8,
qp_p 88A56E78, mlid 2c0, mgid ffff051412ff`30000c280010000

00000688          kernel   4          276       2          364
01\02\2007-13:28:02:859            completes with ERROR status IB_SUCCESS

00000689          kernel   0          0          3          129
01\02\2007-13:28:02:750            mlnx_enable_cq_notify()===>

00000690          kernel   0          0          3          130
01\02\2007-13:28:02:750            completes with ERROR status IB_SUCCESS

...

00000776          kernel   0          0          3          373
01\02\2007-13:28:03:109            mlnx_enable_cq_notify()===>

00000777          kernel   0          0          3          374
01\02\2007-13:28:03:109            completes with ERROR status IB_SUCCESS

00000778          kernel   4          272       3          375
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 89918F40,
qp_p 88A56E78, mlid 2c0, mgid ffff051412ff`30000c280010000

00000779          kernel   4          272       3          376
01\02\2007-13:28:03:296            completes with ERROR status IB_SUCCESS

00000780          kernel   4          272       3          377
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88DB3F00,
qp_p 88A56E78, mlid 1c0, mgid ffff1b4012ff`100000000000000

00000781          kernel   4          272       3          378
01\02\2007-13:28:03:296            completes with ERROR status IB_SUCCESS

00000782          kernel   4          272       3          379
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88A48DA0,
qp_p 88A56E78, mlid 6c0, mgid ffff051412ff`ffffa8ff00ff0000

00000783          kernel   4          272       3          380
01\02\2007-13:28:03:296            completes with ERROR status IB_SUCCESS

00000784          kernel   4          272       3          381
01\02\2007-13:28:03:296            mlnx_detach_mcast() :mcasth 88A94DD0,
qp_p 88A56E78, mlid c0, mgid ffff1b4012ff`ffffffff00000000

00000785          kernel   4          272       3          382
01\02\2007-13:28:03:296            completes with ERROR status IB_SUCCESS

00000786          kernel   4          280       1          383
01\02\2007-13:28:22:781            mlnx_query_ca()===>

00000787          kernel   4          280       1          384
01\02\2007-13:28:22:781            mlnx_query_ca() :port 0 gid0:

00000788          kernel   4          280       1          385
01\02\2007-13:28:22:781            mlnx_query_ca() :
0xfe80000000-0x08f14398095

00000789          kernel   4          280       1          386
01\02\2007-13:28:22:781            mlnx_query_ca() :port 1 gid0:

00000790          kernel   4          280       1          387
01\02\2007-13:28:22:781            mlnx_query_ca() :
0xfe80000000-0x08f14398096

00000791          kernel   4          280       1          388
01\02\2007-13:28:22:781            mlnx_query_ca() :Space required 1898 used
1898

00000792          kernel   4          280       1          389
01\02\2007-13:28:22:781            mlnx_conv_hca_cap() :Port 1 port_guid
0x8f10403980095

00000793          kernel   4          280       1          390
01\02\2007-13:28:22:781            mlnx_conv_hca_cap() :Port 2 port_guid
0x8f10403980096

00000794          kernel   4          280       1          391
01\02\2007-13:28:22:781            mlnx_query_ca()<===

 

 

Ipoib wpp log:

 

00000130          kernel   0          0          0          130
01\02\2007-13:28:01:468            [IPoIB] :ipoib_check_for_hang():]

00000131          kernel   4          280       0          133
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_reset_all():[

00000132          kernel   4          280       0          134
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():[

00000133          kernel   4          280       0          135
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():]

...

00000140          kernel   4          280       0          150
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():[

00000141          kernel   4          280       0          151
01\02\2007-13:28:02:781            [IPoIB] :__endpt_destroying():]

00000142          kernel   4          280       0          152
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_reset_all():]

00000143          kernel   4          280       0          160
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_down():]

00000144          kernel   4          280       1          131
01\02\2007-13:28:02:781            [IPoIB] :__ipoib_pnp_cb() :Link DOWN!

00000145          kernel   4          280       1          132
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_down():[

00000146          kernel   4          312       1          153
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000147          kernel   4          312       1          154
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving MCast
group

00000148          kernel   4          312       1          164
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000149          kernel   4          312       1          165
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000150          kernel   4          312       1          166
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000151          kernel   4          280       1          170
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_up():[

00000152          kernel   4          280       1          171
01\02\2007-13:28:02:781            [IPoIB] :ipoib_port_up():]

00000153          kernel   4          308       2          145
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000154          kernel   4          308       2          147
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving MCast
group

00000155          kernel   4          308       2          161
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000156          kernel   4          308       2          162
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000157          kernel   4          308       2          163
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000158          kernel   4          276       2          191
01\02\2007-13:28:02:859            [IPoIB] :__bcast_cb():[

00000159          kernel   4          276       2          192
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_add_bcast():[

00000160          kernel   4          276       2          193
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[

00000161          kernel   4          276       2          194
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]

00000162          kernel   4          276       2          195
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[

00000163          kernel   4          276       2          196
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast() :Create
av for MAC: 00-00-00-00-00-00

00000164          kernel   4          276       2          197
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[

00000165          kernel   4          276       2          198
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]

00000166          kernel   4          276       2          199
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]

00000167          kernel   4          276       2          200
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked():[

00000168          kernel   4          276       2          201
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: FF-FF-FF-FF-FF-FF

00000169          kernel   4          276       2          202
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[

00000170          kernel   4          276       2          203
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]

00000171          kernel   4          276       2          204
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_add_bcast():]

00000172          kernel   4          276       2          205
01\02\2007-13:28:02:859            [IPoIB] :__ib_mgr_activate():[

00000173          kernel   4          276       2          206
01\02\2007-13:28:02:859            [IPoIB] :__ib_mgr_activate():]

00000174          kernel   4          276       2          207
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active():[

00000175          kernel   4          276       2          208
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():[

00000176          kernel   4          276       2          209
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref():[

00000177          kernel   4          276       2          210
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Look for :
MAC: 01-00-5E-00-00-01

00000178          kernel   4          276       2          211
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Failed
endpoint lookup.[IpoIB] :__endpt_mgr_ref():]

00000179          kernel   4          276       2          212
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[

00000180          kernel   4          276       2          213
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]

00000181          kernel   4          276       2          214
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked():[

00000182          kernel   4          276       2          215
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: 01-00-5E-00-00-01

00000183          kernel   4          276       2          216
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[

00000184          kernel   4          276       2          217
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]

00000185          kernel   4          276       2          218
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():]

00000186          kernel   4          276       2          219
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():[

00000187          kernel   4          276       2          220
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref():[

00000188          kernel   4          276       2          221
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Look for :
MAC: 01-80-C2-00-00-03

00000189          kernel   4          276       2          222
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_ref() :Failed
endpoint lookup.[IpoIB] :__endpt_mgr_ref():]

00000190          kernel   4          276       2          223
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():[

00000191          kernel   4          276       2          224
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_create():]

00000192          kernel   4          276       2          225
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked():[

00000193          kernel   4          276       2          226
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert_locked()
:insert  :  MAC: 01-80-C2-00-00-03

00000194          kernel   4          276       2          227
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():[

00000195          kernel   4          276       2          228
01\02\2007-13:28:02:859            [IPoIB] :__endpt_mgr_insert():]

00000196          kernel   4          276       2          229
01\02\2007-13:28:02:859            [IPoIB] :ipoib_port_join_mcast():]

00000197          kernel   4          276       2          230
01\02\2007-13:28:02:859            [IPoIB] :ipoib_resume_oids():[

00000198          kernel   4          276       2          231
01\02\2007-13:28:02:859            [IPoIB] :ipoib_resume_oids():]

00000199          kernel   4          276       2          232
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active() :Link UP!

00000200          kernel   4          276       2          233
01\02\2007-13:28:02:859            [IPoIB] :ipoib_set_active():]

00000201          kernel   4          276       2          234
01\02\2007-13:28:02:859            [IPoIB] :__bcast_cb():]

00000202          kernel   4          276       2          235
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():[

00000203          kernel   4          276       2          236
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[

00000204          kernel   4          276       2          237
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast() :Create
av for MAC: 01-00-5E-00-00-01

00000205          kernel   4          276       2          238
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[

00000206          kernel   4          276       2          239
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]

00000207          kernel   4          276       2          240
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]

00000208          kernel   4          276       2          241
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():]

00000209          kernel   4          276       2          242
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():[

00000210          kernel   4          276       2          243
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():[

00000211          kernel   4          276       2          244
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast() :Create
av for MAC: 01-80-C2-00-00-03

00000212          kernel   4          276       2          245
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():[

00000213          kernel   4          276       2          246
01\02\2007-13:28:02:859            [IPoIB] :__create_mcast_av():]

00000214          kernel   4          276       2          247
01\02\2007-13:28:02:859            [IPoIB] :ipoib_endpt_set_mcast():]

00000215          kernel   4          276       2          248
01\02\2007-13:28:02:859            [IPoIB] :__mcast_cb():]

00000216          kernel   4          320       3          136
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000217          kernel   4          320       3          139
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000218          kernel   4          320       3          141
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000219          kernel   4          320       3          143
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000220          kernel   4          320       3          144
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000221          kernel   4          320       3          146
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving MCast
group

00000222          kernel   4          320       3          155
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000223          kernel   4          320       3          156
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000224          kernel   4          320       3          157
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000225          kernel   4          320       3          158
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():[

00000226          kernel   4          320       3          159
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup() :Leaving MCast
group

00000227          kernel   4          320       3          167
01\02\2007-13:28:02:781            [IPoIB] :__endpt_cleanup():]

00000228          kernel   4          320       3          168
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():[

00000229          kernel   4          320       3          169
01\02\2007-13:28:02:781            [IPoIB] :__endpt_free():]

00000230          kernel   0          0          3          172
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb():[

00000231          kernel   0          0          3          173
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_add_local():[

00000232          kernel   0          0          3          174
01\02\2007-13:28:02:781            [IPoIB] :ipoib_endpt_create():[

00000233          kernel   0          0          3          175
01\02\2007-13:28:02:781            [IPoIB] :ipoib_endpt_create():]

00000234          kernel   0          0          3          176
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_insert():[

00000235          kernel   0          0          3          177
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_insert():]

00000236          kernel   0          0          3          178
01\02\2007-13:28:02:781            [IPoIB] :__endpt_mgr_add_local():]

00000237          kernel   0          0          3          179
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb() :Received port
info: link width = 2.

00000238          kernel   0          0          3          180
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate():[

00000239          kernel   0          0          3          181
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate() :Link speed is
2.5Gs

00000240          kernel   0          0          3          182
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate() :Link width is
4X

00000241          kernel   0          0          3          183
01\02\2007-13:28:02:781            [IPoIB] :ipoib_set_rate():]

00000242          kernel   0          0          3          184
01\02\2007-13:28:02:781            [IPoIB] :__port_get_bcast():[

00000243          kernel   0          0          3          185
01\02\2007-13:28:02:781            [IPoIB] :__port_get_bcast():]

00000244          kernel   0          0          3          186
01\02\2007-13:28:02:781            [IPoIB] :__port_info_cb():]

00000245          kernel   2624     2732     3          187
01\02\2007-13:28:02:781            [IPoIB] :__bcast_get_cb():[

00000246          kernel   2624     2732     3          188
01\02\2007-13:28:02:781            [IPoIB] :__port_join_bcast():[

00000247          kernel   2624     2732     3          189
01\02\2007-13:28:02:781            [IPoIB] :__port_join_bcast():]

00000248          kernel   2624     2732     3          190
01\02\2007-13:28:02:781            [IPoIB] :__bcast_get_cb():]

00000249          kernel   0          0          3          249
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():[

00000250          kernel   0          0          3          250
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():]

...

00000339          kernel   0          0          3          339
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():[

00000340          kernel   0          0          3          340
01\02\2007-13:28:03:109            [IPoIB] :__endpt_mgr_get_by_gid():]

00000341          kernel   0          0          0          341
01\02\2007-13:28:03:468            [IPoIB] :ipoib_check_for_hang():[

 

 

Thanks,

Anatoly

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070102/0ca719bb/attachment.html>


More information about the ofw mailing list