[ofa-general] RE: [ewg] Not seeing any SDP performance changes inOFED 1.3 beta, and I get Oops when enabling sdp_zcopy_thresh

Scott Weitzenkamp (sweitzen) sweitzen at cisco.com
Wed Dec 12 15:25:56 PST 2007


Jim, when do you plan to enably bzcopy by default?

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems


 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jim Mott
> Sent: Friday, November 30, 2007 12:04 PM
> To: ewg at lists.openfabrics.org
> Cc: general at lists.openfabrics.org
> Subject: [ofa-general] RE: [ewg] Not seeing any SDP 
> performance changes inOFED 1.3 beta, and I get Oops when 
> enabling sdp_zcopy_thresh
> 
> Hi,
>   This kernel Oops is new and I will look at it.  Dotan and 
> the Mellanox regression tests have been keeping me busy 
> recently.  There
> was a problem like this, but only in multi-threaded apps 
> using a single socket or when doing cleanup after ^C.
> 
>   I will re-enable default bzcopy behavior once all the 
> important Mellanox regression tests are passing.  Until then, 
> setting the
> sdp_zcopy_threah variable by hand (8192 and up should give 
> better performance) and running simple tests like netperf should be
> working fine.  You should not be seeing any problem here.  [I 
> have only tested locally with x86_64 rhat4u4, rhat5, 2.6.23.8, and
> 2.6.24-rc2.  Mellanox regression tests everything and they 
> have not submitted this Oops yet.]
> 
>   I have opened bugs in the openfabrics bugzilla for 
> everything I am currently working on.  It is down right now 
> or I would add
> pointers.
> 
> 
> Here is my work list; additions or priority changes welcome:
> 
> SDP OPEN ISSUES LIST (Priority order)
> =====================================
> 1) DONE: BUG: Unload of mlx4 and ib_sdp fails while SDP active
>   11/6 [PATCH 1/1 V2] SDP - Fix reference count bug ...
> 
> 2) DONE: BUG: Many data corruption failures
>   11/11 [PATCH 1/1] SDP - Fix bug where zcopy bcopy returns ...
> 
> 3) DONE: Bug 793 - kernel BUG at net/core/skbuff.c:95!
>   11/26 [PATCH 1/1] SDP - bug793; skbuff changes ...
> 
> 4) TODO: BUG: kernel oops in SDP regression 
>   Replicated problem by hitting ^C during a transfer.  I have 
> created a patch that fixes the problem, but it needs more work
> to move into production.  There are some side effects I do not
> yet understand.
>   This is the one I am working on now.  I hope to drop it soon.
> There is a bug open tracking it.
> 
> 5) TODO: BUG: libsdp returns good RC when it should fail
> 
> 6) TODO: BUG: aio_test fails in SDP regression
> 
> 7) TODO: Bug 779 - Lock ordering problem during accept on 1.2.5
>   After building a 2.6.23.8 kernel with lock checking enabled, I
> can not reproduce this problem.  Looks like I'll need more input
> from the reporter.  (Bug updated to say this).  I will continue to
> code review though.
> 
> 8) DONE: Bug 294 - connect does not allow AF_INET_SDP
>   [fix in bugzilla dropped] 
> 
> 9) DONE: Backport work needed to support 2.6.24
> 
> 10) TODO: Package user space libsdp for Redhat
>   This is supposed to be easy to do, but it will take me some time
> to figure out the detail.  
> 
> 11) DONE: BUG: Memory leak
>   11/20 [PATCH 1/1 v2] SDP - Fix a memory leak in bzcopy
> -----Original Message-----
> From: ewg-bounces at lists.openfabrics.org 
> [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Scott 
> Weitzenkamp (sweitzen)
> Sent: Friday, November 30, 2007 12:37 PM
> To: Jim Mott; Scott Weitzenkamp (sweitzen); ewg at lists.openfabrics.org
> Cc: general at lists.openfabrics.org
> Subject: [ewg] Not seeing any SDP performance changes in OFED 
> 1.3 beta, and I get Oops when enabling sdp_zcopy_thresh
> 
> Jim,
> 
> Using netperf with TCP_STREAM and TCP_RR, I'm not seeing any 
> changes in
> SDP throughput or CPU utilization comparing OFED 1.3 beta and OFED
> 1.2.5.  Looks like I need to set a non-zero value in
> /sys/module/ib_sdp/sdp_zcopy_thresh?  Do you plan to enable this by
> default soon?
> 
> I tried "echo 4096 > /sys/module/ib_sdp/sdp_zcopy_thresh" on RHEL4 and
> then tried netperf, and got an Oops.
> 
> Unable to handle kernel NULL pointer deref
> erence at 0000000000000000 RIP:
> <Nov/30 10:33 am><ffffffff80163ff0>{put_page+0}
> <Nov/30 10:33 am>PML4 1a3047067 PGD 1a7a6d067 PMD 0
> <Nov/30 10:33 am>Oops: 0000 [1] SMP
> <Nov/30 10:33 am>CPU 0
> <Nov/30 10:33 am>Modules linked in: parport_pc lp parport autofs4
> i2c_dev i2c_co
> re nfs lockd nfs_acl sunrpc rdma_ucm(U) rds(U) ib_sdp(U) rdma_cm(U)
> iw_cm(U) ib_
> addr(U) mlx4_ib(U) mlx4_core(U) ds yenta_socket pcmcia_core dm_mirror
> dm_multipa
> th dm_mod joydev button battery ac uhci_hcd ehci_hcd shpchp 
> ib_mthca(U)
> ib_ipoib
> (U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U)
> ib_core(U) md5
>  ipv6 e1000 floppy ata_piix libata sg ext3 jbd mptscsih mptsas mptspi
> mptscsi mp
> tbase sd_mod scsi_mod
> <Nov/30 10:33 am>Pid: 6802, comm: netperf241 Not tainted
> 2.6.9-55.ELlargesmp
> <Nov/30 10:33 am>RIP: 0010:[<ffffffff80163ff0>]
> <ffffffff80163ff0>{put_page+0}
> <Nov/30 10:33 am>RSP: 0018:00000101a7bcbbc0  EFLAGS: 00010203
> <Nov/30 10:33 am>RAX: 0000000000000000 RBX: 0000000000000001 RCX:
> 00000000000002
> 02
> <Nov/30 10:33 am>RDX: 00000101b0b43e80 RSI: 0000000000000202 RDI:
> 00000000000000
> 00
> <Nov/30 10:33 am>RBP: 00000101b85761c0 R08: 0000000000000000 R09:
> 00000000000000
> 00
> <Nov/30 10:33 am>R10: 0000000000000246 R11: ffffffffa02e0e36 R12:
> 00000101a4b330
> 80
> <Nov/30 10:33 am>R13: 00000101a7bcbd58 R14: 0000000000000000 R15:
> 00000000000100
> 00
> <Nov/30 10:33 am>FS:  0000002a95696940(0000) GS:ffffffff80500380(0000)
> knlGS:000
> 0000000000000
> <Nov/30 10:33 am>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> <Nov/30 10:33 am>CR2: 0000000000000000 CR3: 0000000000101000 CR4:
> 00000000000006
> e0
> <Nov/30 10:33 am>Process netperf241 (pid: 6802, threadinfo
> 00000101a7bca000, tas
> k 00000101a70df030)
> <Nov/30 10:33 am>Stack: ffffffffa02e110a 0000000000000100
> 0000000000000000 00000
> 00000529780
> <Nov/30 10:33 am>       0001000000000246 0000000000000246
> 000000008013feac 00000
> 800ffffffe0
> <Nov/30 10:33 am>       0000000000000000 00000101a7bcbe88
> <Nov/30 10:33 am>Call 
> Trace:<ffffffffa02e110a>{:ib_sdp:sdp_sendmsg+724}
> <fffffff
> f801478b2>{queue_delayed_work+101}
> <Nov/30 10:33 am>       <ffffffffa02c6200>{:ib_addr:queue_req+122}
> <ffffffff802a
> 7ecb>{sock_sendmsg+271}
> <Nov/30 10:33 am>       <ffffffff80169a61>{do_no_page+916}
> <ffffffff801359a8>{au
> toremove_wake_function+0}
> <Nov/30 10:33 am>       <ffffffff802a7c53>{sockfd_lookup+16}
> <ffffffff802a939a>{
> sys_sendto+195}
> <Nov/30 10:33 am>       <ffffffff801242b9>{do_page_fault+577}
> <ffffffff801934c8>
> {dnotify_parent+34}
> <Nov/30 10:33 am>       <ffffffff80179335>{vfs_read+248}
> <ffffffff8011026a>{syst
> em_call+126}
> <Nov/30 10:33 am>
> 
> <Nov/30 10:33 am>Code: 8b 07 48 89 fa f6 c4 80 74 3b 48 8b 57 10 8b 02
> 48 89 d1
> f6
> <Nov/30 10:33 am>RIP <ffffffff80163ff0>{put_page+0} RSP
> <00000101a7bcbbc0>
> <Nov/30 10:33 am>CR2: 0000000000000000
> <Nov/30 10:33 am> <0>Kernel panic - not syncing: Oops
> 
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 



More information about the ewg mailing list