InfiniBand
- max 48k nodes per subnetwork
- GPU Direct: allow one GPU to access the memory of another.
- SHARP: Scalable Hierarchical Aggregation and Reduction Protocol:
- Fabric: (phisical infrastructure) collection of links, switches and routers that connect a set of HCAs.
-
Subnet: (logical infra) ports and associated links that have a common subnet ID and are managed by a common subnet manager.
-
Speeds
- Technology: Link Rate (Link Speed x Link Width)
- SDR: 10gbps (2.5 x 4)
- QDR: 40gbps (10 x 4)
- FDR: 56gbps (14 x 4)
- EDR: 100gbps (25 x 4)
- HDR: 200gbps (50 x 4)
- NDR: 400gbps (100 x 4)
- XDR: 800gbps (200 x 4)
-
GDR: 1600gbps (400 x 4)
-
Network elements
- Gateway: bridge between ethernet and infiniband
- HCA (Host Channel Adapter): NIC. Network Fabric Interconnect.
- Each HCA can have a single or multiple nodes. Each node owns one physical port. Each HCA owns a GUID. If the card has multiple nodes, the GUID are consecutive.
-
Router: Interconnect multiple infiniband subnets
-
Subnet Manager
- Nodes and links discovery.
- Local identifier assignments (LIDs).
- Routing table calculations and deployments.
- Configuring nodes and ports parameters like the QoS policy.
Subnet Manager
- Discover the topology
- Assing Local IDentifiers to nodes (LIDs)
- Calculate and programm switch forwarding tables
- Manage all the elements and monitor changes.
It is a software solution and can be set up in a switch, server or specialized device.
It is composed by the - Subnet Manager (SM) - Nodes: any managed entity, such as switch, HCA, router - Agent (SMA): Responds to manager queries and sends traps to the manager when required. Each node requires a SMA to allow the SM to configure the node.
Management Datagrams (MADs): message format for manager-agent communication.
Addresses
- Layer 1: GUID: Globally Unique Identifier. Unique address burned onto the hardware by the vendor. 64 bits.
- Layer 2: LID: Local Identifier. Address assigned by the SM. Unique within the subnet. Used by the switches for packet forwarding. 16 bits. In LRH.
- Layer 3: GID: Global Identifier. Used to identify end ports or multicast groups. They are used by routers for packet forwarding between subnets and are unique within those subnetworks.
Monitoring
OFED Open Fabrics Enterprise Distribution openfabrics.org
MLNX_OFED y DOCA_OFED
Check the driver is installed and started
Check if the host was discovered as an infiniband node
Port must be up and active.Physical State:
- Polling: No cable connected, port damaged or incompatibility between nodes.
- Disabled: Disabled. Enable with `ibportstate`
- PortConfigurationTraining: negotiating
- LinkUp: connected
- LinkError Recovery: Faulty cable
Verify Layer 2 connectivity (ping). Needs to run on both ends.
Trace the path (traceroute). Can be issued from any node.
Port State, link width, link speed, LID
List all the switches and subnet info (guid, name, ports, lid)
List all HCAs in the subnet (guid, name, ports)
List HCAs and nodes
Query infiniband switch forwarding tables LFTs
LID is in hex. Output port of000 means the packet is not routed and the packet is processed internally by the switch.
All the nodes and into (LID, port, speed and width, state, peer lid, peer port, hostname)
Perform a discovery and output a topology file. Also display data (Node, node type, node description, links, port, lids and guids)
Display ib devices installed on local server
Display the GID and LID information
Performance benchmarks
Master SM (LID, GUID, priority, state)
Node description of LID
Query all active SMs
Guid types: - system image guid: allow multiple guids to be treated as one - node guid (HCA, switch or router) - port guid (HCA port)
Link Layer (Layer 2)
Types of packets: - Link Management Packets - Data packets
Paket Headers - LRH Local Routing Header - 8 bytes - contains the LID - Used to route packets within the local subnetwork - Contains the QoS configuration with SL and VL - Service Level: Mark in the header for different applications to define its class. - Virtual Lanes: allow multiple virtual links on a single physical link. Each VL has a Tx and Rx buffers which enable separate flow control. 16 max. - VL 15: management only (SM, link control) - VL 0: Data traffic - Packets are mapped to a VL based on their SL value. - Each VL has a weight and priority. - LFT Linear Forwarding Table. Like a mac address table for ib switches. - Contian SL to VL mappings - DLID Destination LID
LID - Assigned by the SM at initialization and when the topology changes. - HCAs are asigned a LID per port - A 1 IC switch is assigned a single LID - Modular switches are assigned with a LID per switch module in the chassis - Each subnet can contain up to: - 48K unicast LID addresses. - 16K multicast LID addresses.
Credit Based Flow Control - The receiver gives credits to the sender. If the sender does not have credit it does not send. This ensures there is no congestion and eliminates the need to retransmission. - It is a credit rather than a debit network. - Allocated per VL.
CRCs - ICRC (Invariant CRC) fields that do not change from source to destination. - VCRC (Varian CRC) all fields in the packet. - Using both allows switches to modify fields and maintain data integrity.
Network Layer
- Routing Router can join multiple topologies. Done with the assignment of a GID for each port.
Uses the packet header GRH Global Routing Header GID of 128 bits inside the GRH subnet prefix (64 bit) + port GUID starts with fe80: ipv6 type address. a GID identifies an end port or a multicast group
Transport Layer
- Queue Pairs. Allow an app to bypass the kernel and exchange data with the HCA. Each QP represent one end of a channel Two queues per QP, Send Queue and Receive Queue They have a QP number. 24 bits. Map the application virtual address into the QP to get direct access to the hardware. One QP per connection
- QP Workflow
- A Work Queue is the application's interface to the InfiniBand fabric
- An application that wishes to send/receive data on the channel, posts a Work Request (WR) to a work queue
- A WR placed on the work queue is called a Work Queue Element (WQE)
- When the HCA completes a WQE, a Completion Queue Element (CQE) is placed on a completion queue
- A Work Queue is the application's interface to the InfiniBand fabric
Transport Service Types - Connection and Datagram Connection: (Unicast) Dedicated QP More use of kernel memory. Datagram: (multicast) Single QP serves multiple connections Segmentation is not supported. Not as performant but you can use a single QP to serve thousands of clients. - Reliable and unreliable Reliable (TCP) Retransmit if a packet is dropped. Uses a Packet Sequence Number (PSN) Uses ACKs and NACKs. Unreliable (UDP) No garantee of arrival. No order of packages.
Mixing those you get these. - RC Reliable Connection - UC Unreliable Connection - RD Reliable Datagram - UD Unreliable Datagram
RDMA works by allowing the sender to read and write directly to the receiver application virtual buffer instead of the network stack.
Partitions (vlans)
A partition describes a set of end nodes within the subnet that may communicate. Ports may be members of multiple partitions at once. Ports in different partitions are unaware of each other.
Partition ID - PKEY 16 bit field in the BTH header.
Membership: types: Full: can communicate with all members Limited: Only can communicate with a full member.
Default partition 0x7FFF. All ports are full members of it.