Skip to content

InfiniBand

  • max 48k nodes per subnetwork
  • GPU Direct: allow one GPU to access the memory of another.
  • SHARP: Scalable Hierarchical Aggregation and Reduction Protocol:
  • Fabric: (phisical infrastructure) collection of links, switches and routers that connect a set of HCAs.
  • Subnet: (logical infra) ports and associated links that have a common subnet ID and are managed by a common subnet manager.

  • Speeds

  • Technology: Link Rate (Link Speed x Link Width)
  • SDR: 10gbps (2.5 x 4)
  • QDR: 40gbps (10 x 4)
  • FDR: 56gbps (14 x 4)
  • EDR: 100gbps (25 x 4)
  • HDR: 200gbps (50 x 4)
  • NDR: 400gbps (100 x 4)
  • XDR: 800gbps (200 x 4)
  • GDR: 1600gbps (400 x 4)

  • Network elements

  • Gateway: bridge between ethernet and infiniband
  • HCA (Host Channel Adapter): NIC. Network Fabric Interconnect.
    • Each HCA can have a single or multiple nodes. Each node owns one physical port. Each HCA owns a GUID. If the card has multiple nodes, the GUID are consecutive.
  • Router: Interconnect multiple infiniband subnets

  • Subnet Manager

  • Nodes and links discovery.
  • Local identifier assignments (LIDs).
  • Routing table calculations and deployments.
  • Configuring nodes and ports parameters like the QoS policy.

Subnet Manager

  • Discover the topology
  • Assing Local IDentifiers to nodes (LIDs)
  • Calculate and programm switch forwarding tables
  • Manage all the elements and monitor changes.

It is a software solution and can be set up in a switch, server or specialized device.

It is composed by the - Subnet Manager (SM) - Nodes: any managed entity, such as switch, HCA, router - Agent (SMA): Responds to manager queries and sends traps to the manager when required. Each node requires a SMA to allow the SM to configure the node.

Management Datagrams (MADs): message format for manager-agent communication.

Addresses

  • Layer 1: GUID: Globally Unique Identifier. Unique address burned onto the hardware by the vendor. 64 bits.
  • Layer 2: LID: Local Identifier. Address assigned by the SM. Unique within the subnet. Used by the switches for packet forwarding. 16 bits. In LRH.
  • Layer 3: GID: Global Identifier. Used to identify end ports or multicast groups. They are used by routers for packet forwarding between subnets and are unique within those subnetworks.

Monitoring

OFED Open Fabrics Enterprise Distribution openfabrics.org

MLNX_OFED y DOCA_OFED

Check the driver is installed and started

sudo ofed_info -s
/etc/init.d/openibd status

Check if the host was discovered as an infiniband node

ibstat
Port must be up and active.

Physical State:
- Polling: No cable connected, port damaged or incompatibility between nodes.
- Disabled: Disabled. Enable with `ibportstate`
- PortConfigurationTraining: negotiating
- LinkUp: connected
- LinkError Recovery: Faulty cable

Verify Layer 2 connectivity (ping). Needs to run on both ends.

# Server
ibping -S

# Client
ibping -L <lid of the server>
ibping -L 18

Trace the path (traceroute). Can be issued from any node.

ibtracert <source LID> <destination LID>
ibtracert 13 18

Port State, link width, link speed, LID

ibportstate

List all the switches and subnet info (guid, name, ports, lid)

ibswitches

List all HCAs in the subnet (guid, name, ports)

ibhosts

List HCAs and nodes

ibnodes

Query infiniband switch forwarding tables LFTs

ibroute <LID of switch>
LID is in hex. Output port of 000 means the packet is not routed and the packet is processed internally by the switch.

All the nodes and into (LID, port, speed and width, state, peer lid, peer port, hostname)

iblinkinfo

Perform a discovery and output a topology file. Also display data (Node, node type, node description, links, port, lids and guids)

ibnetdiscovery

Display ib devices installed on local server

ibv_devices

Display the GID and LID information

ibaddr

Performance benchmarks

ib_read_lat
ib_write_lat
ib_send_lat
ib_read_bw
ib_write_bw
ib_send_bw

Master SM (LID, GUID, priority, state)

sminfo

Node description of LID

smpquery nd <LID>

Query all active SMs

saquery -s



Guid types: - system image guid: allow multiple guids to be treated as one - node guid (HCA, switch or router) - port guid (HCA port)

Types of packets: - Link Management Packets - Data packets

Paket Headers - LRH Local Routing Header - 8 bytes - contains the LID - Used to route packets within the local subnetwork - Contains the QoS configuration with SL and VL - Service Level: Mark in the header for different applications to define its class. - Virtual Lanes: allow multiple virtual links on a single physical link. Each VL has a Tx and Rx buffers which enable separate flow control. 16 max. - VL 15: management only (SM, link control) - VL 0: Data traffic - Packets are mapped to a VL based on their SL value. - Each VL has a weight and priority. - LFT Linear Forwarding Table. Like a mac address table for ib switches. - Contian SL to VL mappings - DLID Destination LID

LID - Assigned by the SM at initialization and when the topology changes. - HCAs are asigned a LID per port - A 1 IC switch is assigned a single LID - Modular switches are assigned with a LID per switch module in the chassis - Each subnet can contain up to: - 48K unicast LID addresses. - 16K multicast LID addresses.

Credit Based Flow Control - The receiver gives credits to the sender. If the sender does not have credit it does not send. This ensures there is no congestion and eliminates the need to retransmission. - It is a credit rather than a debit network. - Allocated per VL.

CRCs - ICRC (Invariant CRC) fields that do not change from source to destination. - VCRC (Varian CRC) all fields in the packet. - Using both allows switches to modify fields and maintain data integrity.

Network Layer

  • Routing Router can join multiple topologies. Done with the assignment of a GID for each port.

Uses the packet header GRH Global Routing Header GID of 128 bits inside the GRH subnet prefix (64 bit) + port GUID starts with fe80: ipv6 type address. a GID identifies an end port or a multicast group

Transport Layer

  • Queue Pairs. Allow an app to bypass the kernel and exchange data with the HCA. Each QP represent one end of a channel Two queues per QP, Send Queue and Receive Queue They have a QP number. 24 bits. Map the application virtual address into the QP to get direct access to the hardware. One QP per connection
  • QP Workflow
    • A Work Queue is the application's interface to the InfiniBand fabric
      • An application that wishes to send/receive data on the channel, posts a Work Request (WR) to a work queue
      • A WR placed on the work queue is called a Work Queue Element (WQE)
      • When the HCA completes a WQE, a Completion Queue Element (CQE) is placed on a completion queue

Transport Service Types - Connection and Datagram Connection: (Unicast) Dedicated QP More use of kernel memory. Datagram: (multicast) Single QP serves multiple connections Segmentation is not supported. Not as performant but you can use a single QP to serve thousands of clients. - Reliable and unreliable Reliable (TCP) Retransmit if a packet is dropped. Uses a Packet Sequence Number (PSN) Uses ACKs and NACKs. Unreliable (UDP) No garantee of arrival. No order of packages.

Mixing those you get these. - RC Reliable Connection - UC Unreliable Connection - RD Reliable Datagram - UD Unreliable Datagram

RDMA works by allowing the sender to read and write directly to the receiver application virtual buffer instead of the network stack.

Partitions (vlans)

A partition describes a set of end nodes within the subnet that may communicate. Ports may be members of multiple partitions at once. Ports in different partitions are unaware of each other.

Partition ID - PKEY 16 bit field in the BTH header.

Membership: types: Full: can communicate with all members Limited: Only can communicate with a full member.

Default partition 0x7FFF. All ports are full members of it.


Switch