pNFS storage

pNFS TESTBEDS

pnfs scale-up clusters

Clients

40 Gigabyte H62-Z6A-Y9 Nodes
4 Blades per Node
100 Blades have 3 Samsung PM9A3 Gen4 x4 M.2, 1.92 TB, each with

Single Socket AMD EPYC 7313 24 Cores - Hyperthreaded 48 Cores
128 GB DDR4 3200 MT/s
1x 200 Gb/s Ethernet (ConnectX-6)
1 Gb/s management

Data Servers

12 Gigabyte R272-Z32-0, each with

Single Socket AMD EPYC 7502 32 Cores - Hyperthreaded 64 Cores
16 NVMe SDDs Gen3 x3 U.2
128 GB DDR4 2933 MT/s
2x 200 Gb/s Ethernet (ConnectX-6)
1 Gb/s management

Mongo Data Servers

4 Aeon Eclipse 2U Gen5, each with

Dual socket 64-core 3.3Ghz EPYC 9575F CPU - 256 threads with HT
16x 2TB Samsung 9100 Pro M.2 NVMe SSDs w/ U.2 adapter
24 x 32 (768) GB 5600 MT/s DDR5
4x 400 Gb/s Ethernet (ConnectX-7)
1 Gb/s Management

Grace-grace Server

1 Supermicro ARS-221GL-NR-01with

Dual Socket 72-core 3.4Ghz NVIDIA ARM Neovers-V2
2x Samsung PM9A3 Gen4 x4 M.2 3.84 TB
2x 240 GB 8532 MT/s LPDDR5X
1x 400 Gb/s Ethernet (ConnectX-7) NDR
1 Gb/s Management

Networking

Hardware

1 Arista DCS-7804-CH with

2x Supervisor DCS-7800-SUP1A Modules
4x 36-port (400G) Line cards (7800R3-36P-LC)

5x Arista DCS-7010TX-48-R (1Gb) Switches

PNFS Scale-out clusters

64-node 100GbE cluster (2 racks)

Single Mellanox 100GbE non-blocking 64-port switch
Each Dell PowerEdge R640 node having:

2x Intel Xeon Gold 6244 8C/16T CPUs
ConnectX-5 VPI NIC in 100 Gigabit Ethernet mode
10GbE control network
42x "Client" role nodes 192GB DDR4 ECC RAM
22x "Server" role nodes 192GB RAM + 5x Samsung 990 Pro 1TB SS

324-node EDR Infiniband cluster (8 racks + switches)

27x Mellanox EDR IB switches in Fat-Tree topology
Each Dell PowerEdge R640 node having:
- 2x Intel Xeon Gold 6244 8C/16T CPUs
- ConnectX-4 VPI NIC in 100 Gigabit EDR Infiniband mode
- 10GbE control network
- 216x "Client" role nodes 192GB DDR4 ECC RAM
- 108x "Server" role nodes 192GB RAM + 5x Samsung 990 Pro 1TB SSD

SOFTWARE/TEST ENVIRONMENTS

EMULAB

Emulab is a software platform that manages the nodes of a testbed cluster. It provides Emulab users with full bare-metal access to nodes. This allows researches to use a wide range of environments in which to develop, debug, and evaluate their systems. The primary Emulab installation is run by the Flux Group, part of the School of Computing at the University of Utah. There are also installations of the Emulab software at more than two dozen sites around the world, ranging from testbeds with a handful of nodes up to testbeds with hundreds of nodes. Emulab is widely used by computer science researchers in the fields of networking and distributed systems. It is also designed to support education and has been used to teach classes in those fields.

MVPNET

MVPNet is an MPI application that allows users to launch a set of qemu-based virtual machines (VMs) as an MPI job. Users are free to choose the guest operating systems to run and have full root access to the guest. Each mvpnet MPI rank runs its own guest VM under qemu. Guest operating systems communicate with each other using a MPI-based virtual private network managed by mvpnet. Each mvpnet guest VM has a virtual Ethernet interface configured using qemu's -netdev stream or -netdev dgram flags. The qemu program connects this type of virtual Ethernet interface to a unix domain socket file on the host system. The mvpnet application reads Ethernet frames sent by its guest OS from its Ethernet interface socket file. It then uses MPI point-to-point operations to forward the frame to the mvpnet rank running the destination guest VM. The destination mvpnet rank delivers the Ethernet frame to its guest VM by writing it to the VM's socket file. In order to route Ethernet frames, mvpnet uses a fixed mapping between its MPI rank number, the guest VM IP address, and the guest VM Ethernet hardware address. Both IPv4 ARP and Ethernet broadcast operations are supported by mvpnet.

OpenCHAMI

The LANL testbed utilizes OpenCHAMI (GitHub) to boot and manage its nodes. OpenCHAMI is an open source, microservice-based system management platform that adheres to principles of the cloud. Nodes boot images over the network in order to centrally-manage images, which are SquashFS archives built using OpenCHAMI’s image-builder tool. To reduce image complexity, post-boot configuration is handled by OpenCHAMI’s cloud-init server, which is a replacement of the upstream cloud-init by Canonical that organizes post-boot configuration by node group.

Images are built using layers, building subsequent layers on top of existing layers. This is to compartmentalize image changes so that time can be saved rebuilding certain components of an image. Partners are able to perform tasks in booted images such as install packages with the expectation that these changes are ephemeral on reboot. If something in the image needs to be changed persistently, partners can request changes to the configuration of their own image layer and it will be rebuilt using the OpenCHAMI image-builder tool. Otherwise, they can submit their own image to be added to the image repository. If modifying an image is not desired, changes to the cloud-init post-boot configuration for the partner’s image can be requested.