Accelerated opportunities for the research community

The Edinburgh International Data Facility (EIDF) continues to develop its resources to support data-driven discovery. One area that has progressed significantly is the formal launch of the EIDF GPU Service.

The EIDF GPU Service is made up of HPE Apollo 6500 GPU servers containing NVIDIA A100 GPUs. The service currently has a total of 160 GPUs, of which 112 are available for general EIDF users.

Individual projects can have access to up to 12 GPUs. The service is accessed through a Virtual Machine (VM) set up for each project within the EIDF and is operated via Kubernetes.

Using Kubernetes, users can submit work directly to one or more GPUs or make use of a smaller section of a GPU for any individual job. Work using full GPUs can be run on pods of up to eight GPUs per job. Each GPU allocated to users will, by default, have circa 100GB of memory and eight CPU cores associated with it.

Sub-GPU-scale work uses NVIDIA Multi-Instance GPU (MIG) technology and provides for multiple users, or multiple jobs from an individual user, to run work on a single GPU with complete isolation between different jobs. MIG can partition runs of up to seven jobs per GPU, dependent on the target workload. These jobs will by default have memory and CPU scaled to the amount of GPU used.

Advanced cooling system

The GPU Service is housed in Hewlett Packard Enterprise’s (HPE) “Adaptive Rack Cooling System” (ARCS) racks, which means that rather than having to manage cooling of an entire room, the racks are cooled themselves, ensuring that the cooling of the GPU servers is as efficient as possible. This reduces the overall energy required to operate the service, for users. (See ACF article on p14 for more information.)

Early use of the Service delivers

EPCC has been working with the University of Edinburgh’s School of Informatics to develop the EIDF GPU Service and this has already resulted in research being undertaken on it. 

Two papers have recently been published, based on work performed on the new service:

  • Gema et al (2023) ‘Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain’.
  • Kaddour et al (2023) ‘No Train No Gain: Revisiting Efficient Training Algorithms for Transformer-based Language Models’.

EPCC is committed to further developing the EIDF GPU Service and we are looking at the next generation of NVIDIA H100 GPUs to enhance this resource in 2023–2024.

“The EIDF GPU Service is enabling us to explore new methods for training more explainable, robust, and trustworthy AI systems; to design and experiment with models that can learn to search for the information they need for solving arbitrary knowledge-intensive tasks; and to design statistical models for solving challenging biomedical and clinical problems.”

Pasquale Minervini
Lecturer in Natural Language Processing, School of Informatics, University of Edinburgh