History of HPC at UEA

The history of HPC at UEA stretches back to the mid 1990s. 

The HPC provision has always been specified to be a general purpose facility for research that requires compute intensive resources. It has been used to meet a wide variety of research needs within our Science faculties, and is now attracting users from across the University from the Medical School to Economics.

Over the years HPC provision has been through several iterations.  It has grown from a handful of Unix workstations serving a dozen applications using less than 1Gb RAM, and 100Gb storage to the Linux cluster of today with over 3000 CPU cores, with over 6Tb or RAM and more than 100Tb of storage, providing approximately 350 applications.

graph of cores and performance

The early days

Unix provision at UEA started in the early 1990s with a "farm" of Dec Alpha machines.

The HPCMG (High Performance Computing Management Group) was set up in Jan 1997 in order to provide for the growing demand for "High Performance Computing" from researchers in the science schools.  A £50 000 startup funding provded for a new machine and 0.5 fte support post

In June 1998 new hardware was installed - cpca12 -  Unix AlphaServer DS20 - dual 533MHz processor with 2Gb memory.  A job distribution system was implemented (running DQS).  It ran 4 batch queues and 4 interactive slots.  Overflow batch queues ran on the newer of the stand alone alpha farm machines (cpca4, 5, 6, and 7).  At the time there were approximately 20 Unix machines scattered around the science schools.

Over the next few years further DS20s were added to the pool.  Storage was on local disks, backed up to tape once a week.

Beo1 - 2003-2005

The HPC world had moved on - Unix machines were becoming obsolete - too expensive compared with the fast-developing Linux machines that were becoming available..

Our first Linux beowulf cluster was intended as a proof of concept - a step into the quickly emerging world of Linux HPC clusters.

It was a small cluster installed by Compusys.  All machines were 32 bit  machines, and it was connected with both Ethernet and Myrinet networks.

  • 8-node dual processor Xeon 2.4GHz with 2Gb RAM per node
  • 5-node dual processor Xeon 3.0GHz with 2Gb RAM per node
  • 16-node Myrinet switch

Storage was provided by local disk on the master node served to the slave nodes.

Cluster1 - 2004-2010

Cluster1 was our first large scale production Linux cluster. It was a ClusterVision installation commissioned in spring 2005.  Between April 2004 and March 2006, the project was funded as part of the university's HEFCE SRIF-2 (Research Infrastructure Funding) project.

The system comprised:

  • Master node which controlled management of the cluster
  • Two fileserver nodes connected to the UEA Storage Area Network.
  • Gigabit ethernet backbone, comprising 4 Foundary switches, used for file transfer.
  • Myrinet low-latency backbone was connected to 32 of the nodes highly optimised for parallel computing.

The Myrinet portion was upgraded to 64 nodes in summer 2008.

Specification

  • 104 AMD Opteron 246 (2.0Ghz) dual processor slave nodes for 64 bit applications 3Gb RAM per node
  • 64 AMD Opteron 2212 (2.0Ghz) dual-core dual processor nodes for 64 bit applications with Myrinet interconnect, 4 Gb RAM memory per node.
  • a small number of Intel Xeon (2.4Ghz) dual processor slave nodes for 32 bit applications (migrated from beo1)
  • Gigabit Ethernet backbone consisting of 4 Foundry FastIron Edge X448 48-port switches
  • 64 node Myrinet low-latency backbone for parallel/shared memory applications
  • 4 AMD Opteron 2212 (2.0Ghz) dual-core dual processor login nodes.
  • Single AMD Opteron 246 (2.0Ghz) dual processor master node used for slave image deployment, cluster-wide management and job scheduling
  • 2 AMD Opteron 246 (2.0Ghz) dual processor fileserver nodes connected to the UEA Storage Area Network 
  • Suse 9.1
  • Sun Grid Engine (SGE) v6.0u7 queueing system

Cluster1 was decommissioned in 2010.

Escluster - 2008-2012

The UEA ESCluster was the first HPC ‘bright cluster' in Europe that used Dell PowerEdge Servers featuring Intel Quad Core processor technology.  A bright cluster is a pre-configured, high-performance computing cluster.

Funded by HEFCE SRIF3, the Dell and ClusterVision installation began in the 1st quarter of 2008, coming online to our users in the 3rd quarter.

Hardware - The cluster was a Dell and ClusterVision installation. Access was via a login node, and jobs were distributed to the slave nodes using SGE 6.1.

Compute Nodes

  • 56 Intel Harpertown dual quad core 2.66GHz, with 16Gb RAM
  • 48 Intel Harpertown dual quad core 2.66GHz, with 8Gb RAM, with low latency 10GBS Infiniband interconnect (in two 24 node sets) for parallel jobs
  • 4 Intel Harpertown dual quad core 2.66GHz, with 32Gb RAM
  • 1 Dell R900 Intel Harpertown quad quad core 2.4GHz, with 128Gb RAM
  • The cluster ran on Scientific Linux 5, with ClusterVision OS 3 and Sun Grid Engine.

GPFS Storage Cluster

  • 4 Dell PowerEdge 1950 Intel Harpertown dual quad core 2.0GHz 8Gb RAM – Storage nodes (two scratch attached, two SAN attached)
  • 2 HP ProLiant DL380 G5 Intel dual quad core 2.0GHz 16GB RAM – SAN attached nodes
  • 2 HP ProLiant DL380 G5 Intel dual dual core 2.0GHz 16GB RAM – SAN attached nodes
  • SVC attached IBM Enterprise SAN storage capacity approx 14TB
  • Dell MD3000/MD1000 Storage array capacity approx 20TB
  • HSM Tape archive capacity 30TB

Grace - 2010 to 2018

Building on the successful provision of High Performance Computing resources to the research community at UEA for a number of years, the Research Computing Services tendered for a new High Performance Computing Cluster to be installed in 2010 and looked to develop an ongoing partnership with a HPC provider. This included meeting a number of challenges: Provide effective and reliable HPC resource fitting the research communities requirements Sustainable HPC Making HPC more accessible

A partnership was formed with Viglen Ltd, who share the goals of developing High Performance Computing with us, and were keen  to engage in developing a true collaborative partnership to take Research Computing at UEA into the future.

The new cluster was funded by ISD and was a major advance on the existing resource, providing a significant increase in both core count and performance.

A competition was launched to name the cluster. Toby Richmond's (MAC) suggestion of ‘Grace' was selected from almost 800 entries. Toby said: "I thought it would be a good idea to recognise the contribution of female IT pioneers like Grace Hopper". The judges also noted the name provides a relevant and appropriate acronym in ‘Greener Research Computing Environment'.

Grace (the first iteration) ran Red Hat compatible Centos 5.5 and the powerful Platform LSF workload manager and consisted  of:

  •     168 Dual processor, six core Intel X5650 2.66GHz systems
  •     Each system with 24Gig of RAM (2Gig per core).
  •     Quad Data Rate Infiniband on 56 nodes – 672 parallel cores
  •     Total of 2016 cores
  •     Theoretical peak performance of 21.45TFlops
     

Grace used the existing GPFS storage cluster:

  •     2 Dell PowerEdge 1950 Intel Harpertown dual quad core 2.0GHz 8Gb RAM – SAN attached  nodes
  •     2 HP ProLiant DL380 G5 Intel dual quad core 2.0GHz 16GB RAM – SAN attached nodes
  •     2 HP ProLiant DL380 G5 Intel dual dual core 2.0GHz 16GB RAM – SAN attached nodes
  •     SVC attached IBM Enterprise SAN storage current capacity approx 32TB
  •     Dell MD3000/MD1000 Storage array current capacity approx 20TB
  •     HSM Tape archive current capacity 40TB
     

Over the following years there was an ongoing increase in capacity in grace, encompassing increased core count for both sequential and parallel computing, faster processors, and more capacity for large memory computing.  Storage capacity (backed  up, scratch, and archive) grew substantially to over 100Tb. 

In 2012 phase 2 added an additional 64 nodes providing an additional 1024 cores running on Intel Sandybridge E5-2670 2.6GHz CPUs.

In 2013 phase 3, the large memory machine count was increased, the addition of 68 compute nodes providing a total of 1088 Sandybride cores.  Once again these nodes were based on the Intel Xeon E5-2670 2.60GHz 8 core processors, providing each node with 16 computation slots, and 32GB of RAM.  Additional Infiniband infrastructure wereincluded, with the separation of nodes as:

  • 48 Infiniband nodes providing an additional 768 parallel slots
  • 20 standard Ethernet nodes providing an additional 320 sequential slots
  • added to the large memory resource
    • 18 nodes with 48GB of memory
    • 8 nodes with 64GB of memory
    • 1 nodes with 128GB of memory

The upgrade took Grace up to more than 300 computational nodes, increasing the core count up to 4148 cores, with a theoretical peak performance of nearly 65TFlops.  The new hardware was installed in the UEA Data Centre 1, separate from the majority of the existing Grace hardware which is located in UEA Data Centre 2, which helped improve service availability and resilience.

In 2014 phase 4 added a further 640 IB cores, and a storage upgrade.  The new Ivy Bridge CPU based compute nodes were, 2 x 10 core CPU’s (20 cores) running @ 2.5GHz, with 64GB of memory.

HPC - 2015-2020

In 2015 we tendered for a new HPC partner, after our previous agreement reached the end of the four year term. We made a partnership with OCF, specialists in High Performance Computing in HEI and research institutes in the UK, along with Fujitsu as a hardware technology partner.

The first phase created the new hpc.uea.ac.uk cluster with an additional 98 nodes and 1760 computational cores. This included standard nodes, Infiniband parallel nodes and further GPU resource.  Based on the same Operating System and Platform LSF scheduler as Grace, the new cluster had a refreshed software stack with newer versions of the OS and scheduler.  It provided a familiar working environment with the intention over time to newer existing Grace hardware to this new cluster.

In early 2017 we added more upgrades

  • upgraded the  login node to a newer operating system and kernel, providing many benefits from a security and performance perspective
  • introduced a secondary login node to provide more HPC login resilience
  • upgraded the current (150) servers/compute nodes, to a newer operating system and kernel
  • upgraded 4 of the 8 GPU servers/nodes with a newer operating system and kernel
  • Installed an additional 104 nodes/servers into the HPC environment split in the following way:
  • 60 x Ethernet servers/nodes running on Broadwell CPU architecture with 64GB of DDR4 on each node and 16 CPU cores
  • 36 x Infiniband servers/nodes running on Broadwell CPU architecture with 128GB of DDR4 memory on each node and 24 CPU cores – These IB nodes used the latest IB interconnect (FDR and associated fabric) providing 56Gb/s IB performance
  • Installed a new Infiniband 56Gb/s FDR network fabric to provide high speed interconnect/networking to the IB nodes
  • 2 x huge memory servers/nodes running on Broadwell CPU architecture with 512GB of DDR4 memory on each 16 CPU cores
  • 2 x huge memory servers/nodes running on Broadwell CPU architecture with 512GB of DDR4 memory on each 16 CPU cores, for a group of HPC/Bio users who invested in HPC equipment for their own dedicated usage
  • This upgrade provided an additional 1680 CPU cores

In March 2018 we added and additional 68 broadwell compute nodes with 64 Gb RAM, 2 new GPU nodes (24C-Skylake,384GB,GPU-V100), and 2 new huge memory nodes (24C-Skylake,768GB).

In early 2019 we installed 28 (24 core, 96G RAM) Skylake Infiniband nodes, 16 (24 core, 96G RAM)  Sklylake Ethernet nodes and 2  huge memory (24 core, 770G RAM) nodes. This has taken our total HPC core count from approx. 7000 to 8312 cores. 

ADA - 2019-

In 2019 we tendered for a replacement cluster, and extended our partnership with OCF.  As with previous moves to new clusters, we ran both HPC and ADA alongside for a year or so while we migrated software and users to the new setup.
We had a new OS (centos 7), new queueing system - slurm, new management system, new monitoring systems, new hardware

  • 50 compute nodes - skylake Intel Xeon Silver 4116 2.1GHz, 24 cores per node, 96 Gb DDR4 RAM, 10Gb ethernet
  • 4 GPU nodes - skylake Intel Xeon Silver 4116 2.1GHz, 24 cores, 2 gpu cards per node, 384Gb DDR4 RAM, 10Gb ethernet
  • 2 visualisation nodes 
  • 2 login nodes
  • proxmox virtualised management nodes

In 2020 we migrated the newer hardware (broadwell and skylake, ib, and gpu nodes) from HPC to ADA, and retired the older kit.

In spring 2021 we installed a large GPU "farm", and extended the compute resources

  • 15 skylake Intel Xeon Silver 4116 2.1GHz, 24 cores, 2 PGRA CP NVIDIA Quadro RTX6000 gpu cards per node, 384Gb DDR4 RAM, 10Gb ethernet
  • a further 15 skylake Intel Xeon Silver 4116 2.1GHz, 24 cores per node, 96 Gb DDR4 RAM, 10Gb ethernet

In spring 2022 we installed a small test cluster, with its own master and queueing system, which enables us to test storage options, system changes, application and code development.

In summer 2022 we introduced high density compute hardware, and retired the older nodes which had been migrated from HPC

  • 62 Icelake Intel Xeon Platinum 8358 2.6GHz, 64 cores per node, 512 Gb DDR4, 10Gb ethernet

In autumn 2022 we introduced an alternative login and working environment based on OpenOnDemand.

in autumn 2023 we replaced the older ib nodes, and expanded the high memory resource

  • 4 Icelake Intel Xeon Platinum 8358 2.6GHz, 64 cores per node, 4 Tb DDR4, 10Gb ethernet, mellanox infiniband
  • 21 Icelake Intel Xeon Platinum 8358 2.6GHz, 64 cores per node, 512 Gb DDR4, 10Gb ethernet, mellanox infiniband