[openib-general] slab error in kmem_cache_destroy(): cache `ib_mad': Can't free all objects

Hal Rosenstock halr at voltaire.com
Mon May 2 05:14:19 PDT 2005


On Mon, 2005-05-02 at 04:40, Michael S. Tsirkin wrote:
> Hi!
> I have this script to unload all modules:
> 
> killall opensm
> sleep 3
> killall -9 opensm
> modprobe -r ib_ipoib
> modprobe -r ib_umad
> modprobe -r ib_mthca
> 
> So, I try to unload the modules while opensm may still be dying.
> Every now and then I see this crash (below), which seems to indicate
> a race condition or leak somewhere around ib_mad or ib_umad.
> My guess is a mad may still be outstanding.
> 
> Ideas, anyone?
> 
> Further, I'm looking at the mad agent logic and it seems a bit weird that
> ./core/agent.c does kmem_cache_free in agent_send_handler,
> while all allocs are in mad.c.

This has been brought up before on the list. I think much of this could
(and should) be rewritten now using Sean's helper functions.

> What prevents an agent from deregistering while a send is outstanding? 
> What'll free the mad_priv then?

Nothing. I think there is some missing code in ib_agent_port_close to
handle this scenario.

However, unless that MAD from the SM were directed locally (and was
pending), that would not cause the problem where the ib_mad cache could
not be destroyed. I will see if I can recreate this and work up a patch
for this.
 
> log dump below.
> 
> Thanks,
> 
> MST
> 
> This is with 2.6.11 + rev 2235 (latest bits as of now), x86_64 (Intel Nocona).
> 
> 
> May  2 10:36:12 swlab156 kernel: slab error in kmem_cache_destroy(): cache `ib_mad': Can't free all objects
> May  2 10:36:12 swlab156 kernel: 
> May  2 10:36:12 swlab156 kernel: Call Trace:<ffffffff801592af>{kmem_cache_destroy+184} <ffffffff88010714>{:ib_mad:ib_mad_cleanup_module+28} 
> May  2 10:36:12 swlab156 kernel:        <ffffffff8014c044>{sys_delete_module+487} <ffffffff8022991c>{__up_write+28} 
> May  2 10:36:12 swlab156 kernel:        <ffffffff80162952>{sys_munmap+74} <ffffffff8010e0d2>{system_call+126} 
> May  2 10:36:12 swlab156 kernel:        
> May  2 10:36:12 swlab156 kernel: ib_mad: Failed to destroy ib_mad cache

Has this been occuring for a while or is this new (with the recent
changes to mad handling) ?

> Any attempt to load ib_mad after that fails:
> 
> 
> May  2 10:36:25 swlab156 kernel: kmem_cache_create: duplicate cache ib_mad
> May  2 10:36:25 swlab156 kernel: ----------- [cut here ] --------- [please bite here ] ---------
> May  2 10:36:25 swlab156 kernel: Kernel BUG at slab:1472
> May  2 10:36:25 swlab156 kernel: invalid operand: 0000 [1] SMP 
> May  2 10:36:25 swlab156 kernel: CPU 1 
> May  2 10:36:25 swlab156 kernel: Modules linked in: ib_mad ib_core
> May  2 10:36:25 swlab156 kernel: Pid: 14102, comm: modprobe Not tainted 2.6.11-openib
> May  2 10:36:25 swlab156 kernel: RIP: 0010:[kmem_cache_create+1384/1539] <ffffffff801598b6>{kmem_cache_create+1384}
> May  2 10:36:25 swlab156 kernel: RIP: 0010:[<ffffffff801598b6>] <ffffffff801598b6>{kmem_cache_create+1384}
> May  2 10:36:25 swlab156 kernel: RSP: 0018:ffff81015d8c7ee8  EFLAGS: 00010202
> May  2 10:36:25 swlab156 kernel: RAX: 000000000000002a RBX: ffff81015fd69670 RCX: ffffffff804572a8
> May  2 10:36:25 swlab156 kernel: RDX: ffffffff804572a8 RSI: 0000000000000296 RDI: ffffffff8055f0c0
> May  2 10:36:25 swlab156 kernel: RBP: ffff81015fd69480 R08: ffff81015e0976c0 R09: 0000000000000000
> May  2 10:36:25 swlab156 kernel: R10: 0000000000000000 R11: 0000000000000080 R12: ffffffff8055f0c0
> May  2 10:36:25 swlab156 kernel: R13: 0000000000002000 R14: ffff810000000000 R15: 0000000000000080
> May  2 10:36:25 swlab156 kernel: FS:  00002aaaaade26e0(0000) GS:ffffffff80583180(0000) knlGS:0000000000000000
> May  2 10:36:25 swlab156 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> May  2 10:36:25 swlab156 kernel: CR2: 00002aaaaaacc000 CR3: 000000013f67a000 CR4: 00000000000006e0
> May  2 10:36:25 swlab156 kernel: Process modprobe (pid: 14102, threadinfo ffff81015d8c6000, task ffff81015dcc57f0)
> May  2 10:36:25 swlab156 kernel: Stack: ffffffffffffff80 0000000000000000 0000000000000000 ffffffff88010951 
> May  2 10:36:25 swlab156 kernel:        0000000000000180 ffffffff8045a000 ffffffff88013000 ffffffff80459fc0 
> May  2 10:36:25 swlab156 kernel:        ffffffff80459fc0 00007ffffffff408 
> May  2 10:36:25 swlab156 kernel: Call Trace:<ffffffff88015033>{:ib_mad:ib_mad_init_module+51} <ffffffff8014ba19>{sys_init_module+298} 
> May  2 10:36:25 swlab156 kernel:        <ffffffff8010e0d2>{system_call+126} 
> May  2 10:36:25 swlab156 kernel: 
> May  2 10:36:25 swlab156 kernel: Code: 0f 0b e5 be 3e 80 ff ff ff ff c0 05 48 8b 1b 48 8b 03 0f 18 
> May  2 10:36:25 swlab156 kernel: RIP <ffffffff801598b6>{kmem_cache_create+1384} RSP <ffff81015d8c7ee8>

That is because the destruction didn't work. I'm not sure it should be
expected to. (This is a second level issue which will go away when the
first level one is fixed).

-- Hal




More information about the general mailing list