[ewg] possible bug in rds?
Eli Cohen
eli at dev.mellanox.co.il
Wed Mar 10 08:13:01 PST 2010
Hi Andy,
in our regression tests we've encountered a kernel oops with the
following stack dump:
<start quote>
Call trace:
Mar 1 05:45:50 sw134 kernel: mlx4_en: eth2: Link Down
Mar 1 05:46:00 sw134 kernel: mlx4_en: eth2: Link Up
Mar 1 05:46:00 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Mar 1 05:46:01 sw134 /usr/sbin/cron[16940]: (root) CMD (/mswg/projects/test_suite2/etc/check_daemon.csh >/dev/null)
Mar 1 05:46:01 sw134 /usr/sbin/cron[16941]: (root) CMD (/usr/check_mswg.csh >/dev/null)
Mar 1 05:46:01 sw134 /usr/sbin/cron[16942]: (root) CMD (/.autodirect/LIT/CRONTABS/do_it_now.sh > /dev/null)
Mar 1 05:46:03 sw134 kernel: Unable to handle kernel paging request at 0000000000200200 RIP:
Mar 1 05:46:03 sw134 kernel: <ffffffff88427b8e>{:rdma_cm:rdma_destroy_id+399}
Mar 1 05:46:03 sw134 kernel: PGD 0
Mar 1 05:46:03 sw134 kernel: Oops: 0002 [1] SMP
Mar 1 05:46:03 sw134 kernel: last sysfs file: /class/infiniband/mlx4_0/ports/1/gids/127
Mar 1 05:46:03 sw134 kernel: CPU 0
Mar 1 05:46:03 sw134 kernel: Modules linked in: 8021q mst_pciconf mst_pci rdma_ucm rds_tcp rds_rdma rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_cm ib_sa ib_uverbs ib_umad mlx4_en mlx4_core ib_mad ib_core memtrack autofs4 cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table nfs lockd nfs_acl sunrpc ipv6 af_packet dock button battery ac apparmor nls_iso8859_1 nls_cp437 vfat fat loop dm_mod ohci_hcd ide_cd cdrom generic ehci_hcd shpchp pci_hotplug i2c_piix4 i2c_core usbcore mptctl tg3 floppy ext3 jbd edd fan thermal processor mptsas mptscsih sg mptbase scsi_transport_sas sata_svw libata serverworks sd_mod scsi_mod ide_disk ide_core
Mar 1 05:46:03 sw134 kernel: Pid: 15000, comm: krdsd Tainted: GU 2.6.16.60-0.54.5-smp #1
Mar 1 05:46:03 sw134 kernel: RIP: 0010:[<ffffffff88427b8e>] <ffffffff88427b8e>{:rdma_cm:rdma_destroy_id+399}
Mar 1 05:46:03 sw134 kernel: RSP: 0018:ffff81000dad7dd8 EFLAGS: 00010206
Mar 1 05:46:03 sw134 kernel: RAX: 0000000000100100 RBX: ffff81012d2ba740 RCX: 0000000000200200
Mar 1 05:46:03 sw134 kernel: RDX: ffff81010ee445b8 RSI: ffff8101248c0048 RDI: ffff81012bdaf800
Mar 1 05:46:03 sw134 kernel: RBP: ffff81010ee44400 R08: 0000000000000000 R09: 0000000000000000
Mar 1 05:46:03 sw134 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8101248c0048
Mar 1 05:46:03 sw134 kernel: R13: ffff8101248c0290 R14: ffffffff8846ca40 R15: 0000000000000000
Mar 1 05:46:03 sw134 kernel: FS: 00002b2f96622ae0(0000) GS:ffffffff803dc000(0000) knlGS:0000000000000000
Mar 1 05:46:03 sw134 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Mar 1 05:46:03 sw134 kernel: CR2: 0000000000200200 CR3: 0000000000101000 CR4: 00000000000006e0
Mar 1 05:46:03 sw134 kernel: Process krdsd (pid: 15000, threadinfo ffff81000dad6000, task ffff810126bf5040)
Mar 1 05:46:03 sw134 kernel: Stack: ffff81012bdaf800 ffff81012bdaf800 ffff81010e7fa000 ffffffff884891f4
Mar 1 05:46:03 sw134 kernel: ffff81000100f700 0000230b363c2cf0 000000000002220f 000000000f5eb200
Mar 1 05:46:03 sw134 kernel: ffff8101248c0048 ffffffff802f0652
Mar 1 05:46:03 sw134 kernel: Call Trace: <ffffffff884891f4>{:rds_rdma:rds_ib_conn_shutdown+477}
Mar 1 05:46:03 sw134 kernel: <ffffffff802f0652>{mutex_lock+13} <ffffffff8846cae3>{:rds:rds_shutdown_worker+163}
Mar 1 05:46:04 sw134 kernel: <ffffffff80144e8e>{run_workqueue+139} <ffffffff8014559c>{worker_thread+0}
Mar 1 05:46:04 sw134 kernel: <ffffffff80148525>{keventd_create_kthread+0} <ffffffff80145690>{worker_thread+244}
Mar 1 05:46:04 sw134 kernel: <ffffffff8012cf89>{default_wake_function+0} <ffffffff801487ed>{kthread+236}
Mar 1 05:46:04 sw134 kernel: <ffffffff8010bea6>{child_rip+8} <ffffffff80148525>{keventd_create_kthread+0}
Mar 1 05:46:04 sw134 kernel: <ffffffff80148701>{kthread+0} <ffffffff8010be9e>{child_rip+0}
Mar 1 05:46:04 sw134 kernel:
Mar 1 05:46:04 sw134 kernel: Code: 48 89 01 74 04 48 89 48 08 48 c7 85 b8 01 00 00 00 01 10 00
Mar 1 05:46:04 sw134 kernel: RIP <ffffffff88427b8e>{:rdma_cm:rdma_destroy_id+399} RSP <ffff81000dad7dd8>
Mar 1 05:46:04 sw134 kernel: CR2: 0000000000200200
Mar 1 05:46:09 sw134 kernel: <6>mlx4_en: eth2: Link Down
Mar 1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up
Mar 1 05:46:20 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Mar 1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Down
Mar 1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up
Mar 1 05:46:21 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
<end quote>
Examining the dump I see the failure results in trying to call
hlist_del() twice on the same pointer (I can see that by the poisoned
pointer RCX: 0000000000200200).
Could it be that rds will call rdma_destroy_id() which will result in
the described behaviour?
More information about the ewg
mailing list