InfiniBand

openfabric.org - OFED source.
mellanox.com - compile binaries for major distro, about 3 months lag behind source.
http://tinyurl.com/infiniband - my cache of IB docs.

InfiniBand 101

CA - Control Adaptor
HCA - Host CA (like how FibreChannel has HBA)
LID - Local ID, uniq id of IB end points.
port - a port on the HCA, like NIC ports in ethernet quad card.
GUID - Global Uniq ID
sminfo give the GUID of the subnet manager, can be used to determine if two nodes are on same fabric or in two separate islands.

Symbol Error - Think of this as packet error in the ethernet world. it is the rate of error accumulation that should be of concern.

opensm 	# open fabric subnet manager.  run on one of the host.  
	# additional standy manager on other hosts can be setup.  
	# the daemon process will only bind to first IB port.  
	# HA fabrics should be joined into a single one.
  	# Fix opensm.conf to bind to additional guid.  


opensm --create-config /tmp/opensm.conf 
cp -i /tmp/opensm.conf /etc/opensm/

ibstat
  - all ports should be active.  It would be active if there is running fabric manager to register them when the port comes alive.
  - node guid
  - port guid

ibv_devinfo - similar info to ibstat

rds-ping
rds-info

ompi_info --all
ompi-top
ompi-ps

Troubleshooting commands

iblinkinfo 
iblinkinfo --node-name-map fabrics.node-name-map
	# iblinkinfo provides link info for whole fabric, very quick 
	# node-name-map converts GUID to human name
	# 0xe41d2d03004f1b40 "ib000.cf0 (Mellanox SX6025, S54-21)"

ibnetdiscover 
ibnetdiscover --node-name-map fabrics.node-name-map > ibnet.topo
	# generate a topology file 
	# switch LID would be listed above all the port, see comment field 

# select output of ibnetdiscover

switchguid=0x2c90200410d02(2c90200410d02)
Switch  24 "S-0002c90200410d02"   # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 2 lmc 0
#                                                                               ib switch LID      ^^^

[16]    "H-00188b9097fe8e41"[1](188b9097fe8e42)     # "rac1001 HCA-1" lid 4 4xDDR
[15]    "H-00188b9097fe906d"[1](188b9097fe906e)     # "rac1002 HCA-1" lid 9 4xDDR
^^^^ ib port where host is connected            comment area and host's LID ^^^
ibclearerrors		# reboot will induce errors, this is normal.
ibclearcounters		


hca_self_test.ofed  	# sanity check. ofed-scripts rpm.  /usr/bin
pdsh -g etna0 TERM=vt52 /usr/bin/hca_self_test.ofed | grep -i fail
# the script use tput and eed to force TERM when used with pdsh


perfquery		# simple to read ib counters.  pay special attention to:
			# SymbolErrorCounter
			# LinkErrorRecover

ibqueryerrors -c -s XmWait


ibcheckerrors -b	# very noisy, from infiniband-diags rpm by Mellanox, /sbin



ibdiagnet		# scan fabric, takes a long time

wwibcheck by Yong Qin:
ibcheck -f fabrics.conf -C etna -E  -dd -a -b -O 20 > now.out
ibcheck -f fabrics.conf -C etna -E  -dd -a -b -O 20 > 2hrsLater.out
but can't simple diff to see errors, need to awk out the RcvPkgs columns...

per-node errors listed in bottom section.
This is the header of avail counters:

   NodeDesc Port               Guid ExcBufOverrunErrors LinkDowned LinkIntegrityErrors LinkRecovers RcvConstraintErrors    RcvData RcvErrors    RcvPkts RcvRemotePhysErrors RcvSwRelayErrors SymbolErrors VL15Dropped XmtConstraintErrors    XmtData XmtDiscards    XmtPkts




IB port on switch side

ibportstate - manage specific port on the IB switch, eg turn it on/off, etc.

ibportstate [swlid] [swPortNum] [command]	
ibportstate  2        11         disable	# turn off IB switch port a specific host is connected to
ibportstate  6        11         disable
     # can also use enable, status.  Other form allow for changing speed, etc.

Ref
  1. Yong Qin's IB Diag & Troubleshooting
  2. ulowa basic ib troubleshooting



IBoIP with Bonding Config

OFED 1.4 instructions for configuring bonding for IPoIB interfaces is to create static config in /etc/sysconfig/network-scripts for the file ifcfg-bond0, ifcfg-ib0, ifcfg-ib1, as below:
IB-bond for system WITHOUT ethernet bond
$ cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
IPADDR=10.2.2.98
NETMASK=255.255.255.0
NETWORK=10.2.2.0
BROADCAST=10.2.2.255
ONBOOT=yes
USERCTL=no
TYPE=Bonding

$ cat /etc/sysconfig/network-scripts/ifcfg-ib0  
DEVICE=ib0
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
MASTER=bond0
SLAVE=yes
TYPE=InfiniBand

$ cat /etc/sysconfig/network-scripts/ifcfg-ib1
DEVICE=ib1
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
MASTER=bond0
SLAVE=yes
TYPE=InfiniBand

# relevant entries in  /etc/modprobe.conf ::
## added for Oracle 10/11 (per Cambridge WI)
## make sure that the hangcheck-timer kernel module is set to load when the system boots
options hangcheck-timer hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1
alias ib0 ib_ipoib
alias ib1 ib_ipoib
alias net-pf-27 ib_sdp
##  For IB bonding
alias bond0 bonding
options bond0 miimon=100 mode=1 max_bonds=1

As of OFED 1.4.0 (circa 2009.09), the above bonding config would work, bond0 would be created correctly and disabling the ib port for say ib0 would cause thigns to fail over.

However, fail over won't actually work if the machine also has ethernet bonding configured. The config would successfully create a bond for ib0 and ib1. But the IP would be bond to a specific interface and when the IB port is disabled from the switch, ping and rds-ping would stop working. Maybe it has to do with some bugs in the ifcfg-* scripts in RHEL 5.3 that associate the HW "mac address" of the ibX interfaces incorrectly to the bonding interface. OFED 1.4.x doesn't support the bonding config in /etc/inifiniband/openib.conf anymore. Manually creating the ib-bond after system boot would work, and fail over actually works correctly. Here is the required config:
IB-bond for system with ethernet bond

$ cat /etc/sysconfig/network-scripts/ifcfg-ib0
DEVICE=ib0
BOOTPROTO=none
STARTMODE=onboot
ONBOOT=yes
USERCTL=no
BROADCAST=192.168.2.255
NETMASK=255.255.255.0
#
IPADDR=0.0.0.0
#SLAVE=yes
#MASTER=bond1
TYPE=InfiniBand


$ cat /etc/sysconfig/network-scripts/ifcfg-ib1
DEVICE=ib1
BOOTPROTO=none
STARTMODE=onboot
ONBOOT=yes
USERCTL=no
BROADCAST=192.168.2.255
NETMASK=255.255.255.0
#
IPADDR=0.0.0.0
#SLAVE=yes
#MASTER=bond1
TYPE=InfiniBand


# relevant entries in  /etc/modprobe.conf ::
options bond0 mode=balance-rr miimon=100    max_bonds=2

# ethernet bonds are configured as in stock RHEL 5.3 config.


#!/bin/sh

# init script to start ib-bond at boot time:
#
# run "manual" ib-bond config
# could not do this in /etc/sysconfig/network-scripts as bond1, ib0, ib1 scripts
# as somehow presence of eth bond would make ib-bond fail over not to work.
# maybe a bug in how the network-scripts are parsed...
#
##  nn = startLevel  Sxx Kxx
##  eg start at rc3 and rc5,   start as S56, kill is omitted so no Kxx script
##  maybe as S28 if need to be before oracleasm
##
# chkconfig: 35 28 -
# description: ib-bond config
#

# source function library
. /etc/rc.d/init.d/functions


RETVAL=0
prog="ib-bond-config"

start() {
        ifconfig ib0 up 0.0.0.0/24
        ifconfig ib1 up 0.0.0.0/24
        ib-bond --bond-name bond1 --bond-ip 192.168.2.101/24 --slaves ib0,ib1 --miimon 100
}

stop() {
        ib-bond --bond-name bond1 --stop
}

status() {
        ib-bond --status-all
}

case "$1" in
        start)
                echo -n $"Starting $prog: "
                start
                RETVAL=$?
                [ "$RETVAL" = 0 ] && logger local7.info "ib-bond-config start ok" || logger local7.err "ib-bond-config start failed"
                echo
                ;;
        stop)
                echo -n $"Stopping $prog: "
                #echo 'not implemented yet'
                stop
                [ "$RETVAL" = 0 ] && logger local7.info "ib-bond-config stop ok" || logger local7.err "ib-bond-config stop failed"
                echo
                ;;
        status)
                status
                echo
                ;;
        *)
                echo $"Usage: $0 {start|stop|status}"
                RETVAL=1
esac



exit $RETVAL

## setup aid:
##  sudo cp -p /nfshome/sa/config_backup/lx/conf.test-4000/ib-bond-config /etc/init.d/
##  sudo ln -s /etc/init.d/ib-bond-config /etc/rc.d/rc3.d/S56ib-bond-config
##  sudo ln -s /etc/init.d/ib-bond-config /etc/rc.d/rc5.d/S56ib-bond-config


There is one final catch. eth fail over work for one NIC at a time. If both eth0 and eth1 is ifdown'd, the minute eth0 is ifup, the machine reset. Not sure why, no log message, so may not even be a kernel panic... But if both eth interfaces are down, machine is probably screwed anyway...

RDS and InfiniBand

RDS stands for Reliable Datagram Socket. It was modeled/designed like a UDP replacement, but adds reliability and in-sequence delivery characteristics traditionally only available from TCP. It was to be lightweight, rely on the InfiniBand hardware to do the path mapping and "virtual channel" config.

The RDS developer mailing list has a doc that indicates RDS would not depend on IP when run on InfiniBand, so then IBoIP may not actually need to be configured (eg, how MPICH work with infiniband without any IP support). On the flip side, there are also discussion of implementing RDS over TCP! (eg: take 2) At any rate, rds-ping needs an IP address. rds-info is centered around the premise of IP address. Oracle doesn't work unless IP are assigned for IB interface.

So, for HA (at least for Oracle), it seems that IBoIP fail over from the IB-bonding need to be configured. Oracle cluster/RAC does not have ability to use multiple ib0, ib1 interfaces as its private network, so some sort of automatic HCA fail over would be needed.

RDMA and InfiniBand

RDMA = Remote DMA , ie Remote Direct Memory Access.
It would allow data transfer between two hosts with IB HCA to copy data from one host directly into the other, bypassing the many copy needed of traditional NIC. Combined with low latency of IB, it would give very high transfer speed.
However, RDMA is not implemented by RDS as of OFED 1.4. (it is for TCP? but that would be at the IPoIB layer? Or maybe for things that use IB directly like MPICH... don't know....).


[Doc URL: http://tin6150.github.io/psg/ ]
Last Updated: 2018-02-03
(cc) Tin Ho. See main page for copyright info.


hoti1
sn5050
psg101 sn50 tin6150