PDR Notes (Stage 1) 12/20/04 Matt,Bill Steve Poole, Mike Boorman, Rich Graham (MPI) Libor, Yaron, Roland, Sean, Woody Anafa 1 failure mode (potentially Anafa 2) congestion credit advertisement issue HOQLife setting credit update frequency pathing algorithm(s) path underutilization/deadlock defacto standard for HPC back into IBA spec eventually round robin scheduling PF initiative for new HCA and new switch MPI uses single SL (not using SA to get this) Host based with LMC > 0 ? simulator error injection as part of traffic patterns tracer packets which build source route OpenIB specific attribute ibping kernel mode server and user mode client right now worried about route table not being actual path followed special counters perf counters on every physical switch port topology file both approaches not just one API for topology file native Arbel does not support HCA attached memory currently local attached memory Tavor mode does not support advanced features base memory management features (neither fw version) new work request format (better perf ?) Mellanox issues OpenSM coordination switch credits native Arbel MWs didn't perform well on Tavor SRQ useful for scaling issues implies CM 1.2 DoE looking for 256K QPs (64K now) just started on user space verbs each user process gets doorbell page control path (device independent part libibverbs) Mellanox uses ioctl currently for this big kernel lock read/write on different objects concurrently file descriptor cleanup data path device dependent: libmhtca fast path: post WR, poll CQ direct jump into driver with no context switch wait on completion with interrupt (limited by how fast interrupt can wake up process) futex in 2.6 pthread mutexes as lock (application thread model (not use clone)) 2.6 and new clib (pthreads now work on 2.6) tricks from Mellanox avoid a couple of locks 3 HCA event types/queues completion async event fw command complete event With MSI-X, separate (3) interrupt vectors saves reading interrupt cause register 1500 Mbps -> 1800 Mbps IPoIB performance (20% improvement) PCI-x No PCIe yet as MSI-X doesn't work yet on machines available (lindenhurst developer systems) machine check (may be Intel fix for this) Needs further investigation More than 1 interrupt vector for completion events cpu affinity for interrupt vectors interrupt directly to cpu running application thread also needs extension to verbs (in 1.2) cpu should be working pretty independently PCI-X Intel only correct MSI but AMD 8131 being used with Opteron MSI not available magic address for MSI only one supported in Linux kernel currently MSI mandatory part of PCIe not sure about nVidia DoE not ruling out any platform 3 of interest PPC, Intel, AMD no SA caching (other layer ?) user space daemon (MPI, DAPL, short lived processes) TS SDP has additional cache for full PR IPoIB now supports this cache was in kernel but now doesn't need to be swap when in user space IPoIB naturally implemented storage discovery in user space (target disc passed into kernel) Panassas (2 disks have an IP address) 100 PB (2-3 year timeframe) DHCP server merely need client ID configured to 20 bytes clients may need work to format IPoIB addresses ISC is main one Port bonding (Nitan) if_enslave all MACs to be same on ethernet more for resilency than performance dual Xeon 64 PCIe 80% CPU 350 MBsec IPoIB using MPI profiling with Mellanox driver IPoIB 50% network 50% lower of 50% lower then 70% Mellanox 30% IPoIB with mthca >50% user to kernel copying zero copy TCP static web serving only 10% checksum Performance Tweaks MSI-X TCP Performance Parameters (Woody) cable certification not many optical HCAs anymore or optical dongle (lasers burn on) optics for ISL (12x) 4th posting to kernel.org 2.6.10 before xmas kDAPL in OpenIB not in kernel.org SDP second priority with kDAPL Legal issues with SDP (patents, publications) Libor thinks easily defensible, thinks their disclosure not needed rider with Microsoft ? yes on BSD, no on GPL IT API (sun, HP, IBM into OpenIB) user space API DAPL more momentum (Woody) no reference implementation DAT Collaborative more open than Open Group sockets more applicable to kernel.org Would OpenIB DAPL be reference for DAT Collaborative ? Fujitsu not active in OpenIB mlock for memory pinning for user space more difficult for kernel to do this call mlock in the kernel in user space library register in library then use mmap mlock fragments VMA 1:1 mapping with buffer reference counting mlocks getpage doesn't lock mapping (PTE) only increments ref count (locks page) 2 buffers in same page is another issue (copy on write) can't set copy on write flag from user space (forking) Secure Linux issue ? oversubscription of VM issue (not global, just per user) 2.4 limited to 1GB ? tested with HT ? DDR Feb-April timeframe ultimately 32x ODR PCIe2 spec is out OpenMPI would like to use SRQ starting in March Mellanox SRQ is working just RC (not UC) in Arbel SIDR low priority (Lustre might need this) path migration more important OpenMPI use CM (and PathRecords) ? Topspin has done 50K-75K connections with their CM OpenRDMA.org OLS participation