Hardware

The system is based around the IBM POWER9 CPU and NVIDIA Tesla GPUs. Connectivity within a node is optimised by both the CPUs and GPUs being connected to an NVIDIA NVLink 2.0 bus, and outside of a node by a dual-rail Mellanox EDR InfiniBand interconnect allowing GPUDirect RDMA communications (direct memory transfers to/from GPU memory).

Together with IBM’s software engineering, the POWER9 architecture is uniquely positioned for:

  • Large memory GPU use, as the GPUs are able to access main system memory via POWER9’s large model feature.
  • Multi node GPU use, via IBM’s Distributed Deep Learning (DDL) software.

There are:

  • 2x “login” nodes, each containing:
    • 2x POWER9 CPUs @ 2.4GHz (40 cores total and 4 hardware threads per core), with NVLink 2.0
    • 512GB DDR4 RAM
    • 4x Tesla V100 32G NVLink 2.0
    • 1x Mellanox EDR (100Gbit/s) InfiniBand port
  • 32x “gpu” nodes, each containing:
    • 2x POWER9 CPUs @ 2.7GHz (32 cores total and 4 hardware threads per core), with NVLink 2.0
    • 512GB DDR4 RAM
    • 4x Tesla V100 32G NVLink 2.0
    • 2x Mellanox EDR (100Gbit/s) InfiniBand ports
  • 4x “infer” nodes, each containing:
    • 2x POWER9 CPUs @ 2.9GHz (40 cores total and 4 hardware threads per core)
    • 256GB DDR4 RAM
    • 4x Tesla T4 16G PCIe
    • 1x Mellanox EDR (100Gbit/s) InfiniBand port

The Mellanox EDR InfiniBand interconnect is organised in a 2:1 block fat tree topology. GPUDirect RDMA transfers are supported on the 32 “gpu” nodes only, as this requires an InfiniBand port per POWER9 CPU socket.

Storage is provided by a 2PB Lustre filesystem capable of reaching 10GB/s read or write performance, supplemented by an NFS service providing modest home and project directory needs.