[openib-general] slab error in kmem_cache_destroy(): cache `ib_mad': Can't free all objects
Hal Rosenstock
halr at voltaire.com
Mon May 2 05:14:19 PDT 2005
On Mon, 2005-05-02 at 04:40, Michael S. Tsirkin wrote:
> Hi!
> I have this script to unload all modules:
>
> killall opensm
> sleep 3
> killall -9 opensm
> modprobe -r ib_ipoib
> modprobe -r ib_umad
> modprobe -r ib_mthca
>
> So, I try to unload the modules while opensm may still be dying.
> Every now and then I see this crash (below), which seems to indicate
> a race condition or leak somewhere around ib_mad or ib_umad.
> My guess is a mad may still be outstanding.
>
> Ideas, anyone?
>
> Further, I'm looking at the mad agent logic and it seems a bit weird that
> ./core/agent.c does kmem_cache_free in agent_send_handler,
> while all allocs are in mad.c.
This has been brought up before on the list. I think much of this could
(and should) be rewritten now using Sean's helper functions.
> What prevents an agent from deregistering while a send is outstanding?
> What'll free the mad_priv then?
Nothing. I think there is some missing code in ib_agent_port_close to
handle this scenario.
However, unless that MAD from the SM were directed locally (and was
pending), that would not cause the problem where the ib_mad cache could
not be destroyed. I will see if I can recreate this and work up a patch
for this.
> log dump below.
>
> Thanks,
>
> MST
>
> This is with 2.6.11 + rev 2235 (latest bits as of now), x86_64 (Intel Nocona).
>
>
> May 2 10:36:12 swlab156 kernel: slab error in kmem_cache_destroy(): cache `ib_mad': Can't free all objects
> May 2 10:36:12 swlab156 kernel:
> May 2 10:36:12 swlab156 kernel: Call Trace:<ffffffff801592af>{kmem_cache_destroy+184} <ffffffff88010714>{:ib_mad:ib_mad_cleanup_module+28}
> May 2 10:36:12 swlab156 kernel: <ffffffff8014c044>{sys_delete_module+487} <ffffffff8022991c>{__up_write+28}
> May 2 10:36:12 swlab156 kernel: <ffffffff80162952>{sys_munmap+74} <ffffffff8010e0d2>{system_call+126}
> May 2 10:36:12 swlab156 kernel:
> May 2 10:36:12 swlab156 kernel: ib_mad: Failed to destroy ib_mad cache
Has this been occuring for a while or is this new (with the recent
changes to mad handling) ?
> Any attempt to load ib_mad after that fails:
>
>
> May 2 10:36:25 swlab156 kernel: kmem_cache_create: duplicate cache ib_mad
> May 2 10:36:25 swlab156 kernel: ----------- [cut here ] --------- [please bite here ] ---------
> May 2 10:36:25 swlab156 kernel: Kernel BUG at slab:1472
> May 2 10:36:25 swlab156 kernel: invalid operand: 0000 [1] SMP
> May 2 10:36:25 swlab156 kernel: CPU 1
> May 2 10:36:25 swlab156 kernel: Modules linked in: ib_mad ib_core
> May 2 10:36:25 swlab156 kernel: Pid: 14102, comm: modprobe Not tainted 2.6.11-openib
> May 2 10:36:25 swlab156 kernel: RIP: 0010:[kmem_cache_create+1384/1539] <ffffffff801598b6>{kmem_cache_create+1384}
> May 2 10:36:25 swlab156 kernel: RIP: 0010:[<ffffffff801598b6>] <ffffffff801598b6>{kmem_cache_create+1384}
> May 2 10:36:25 swlab156 kernel: RSP: 0018:ffff81015d8c7ee8 EFLAGS: 00010202
> May 2 10:36:25 swlab156 kernel: RAX: 000000000000002a RBX: ffff81015fd69670 RCX: ffffffff804572a8
> May 2 10:36:25 swlab156 kernel: RDX: ffffffff804572a8 RSI: 0000000000000296 RDI: ffffffff8055f0c0
> May 2 10:36:25 swlab156 kernel: RBP: ffff81015fd69480 R08: ffff81015e0976c0 R09: 0000000000000000
> May 2 10:36:25 swlab156 kernel: R10: 0000000000000000 R11: 0000000000000080 R12: ffffffff8055f0c0
> May 2 10:36:25 swlab156 kernel: R13: 0000000000002000 R14: ffff810000000000 R15: 0000000000000080
> May 2 10:36:25 swlab156 kernel: FS: 00002aaaaade26e0(0000) GS:ffffffff80583180(0000) knlGS:0000000000000000
> May 2 10:36:25 swlab156 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> May 2 10:36:25 swlab156 kernel: CR2: 00002aaaaaacc000 CR3: 000000013f67a000 CR4: 00000000000006e0
> May 2 10:36:25 swlab156 kernel: Process modprobe (pid: 14102, threadinfo ffff81015d8c6000, task ffff81015dcc57f0)
> May 2 10:36:25 swlab156 kernel: Stack: ffffffffffffff80 0000000000000000 0000000000000000 ffffffff88010951
> May 2 10:36:25 swlab156 kernel: 0000000000000180 ffffffff8045a000 ffffffff88013000 ffffffff80459fc0
> May 2 10:36:25 swlab156 kernel: ffffffff80459fc0 00007ffffffff408
> May 2 10:36:25 swlab156 kernel: Call Trace:<ffffffff88015033>{:ib_mad:ib_mad_init_module+51} <ffffffff8014ba19>{sys_init_module+298}
> May 2 10:36:25 swlab156 kernel: <ffffffff8010e0d2>{system_call+126}
> May 2 10:36:25 swlab156 kernel:
> May 2 10:36:25 swlab156 kernel: Code: 0f 0b e5 be 3e 80 ff ff ff ff c0 05 48 8b 1b 48 8b 03 0f 18
> May 2 10:36:25 swlab156 kernel: RIP <ffffffff801598b6>{kmem_cache_create+1384} RSP <ffff81015d8c7ee8>
That is because the destruction didn't work. I'm not sure it should be
expected to. (This is a second level issue which will go away when the
first level one is fixed).
-- Hal
More information about the general
mailing list