Share this on social media:
Embracing functional programming and a better understanding of serial programming challenges is driving a paradigm shift in HPC coding, writes Eugenia Bahit
The difference between serial and parallel programming is enormous and increases when the objective is to take it to a high-performance ecosystem. The training process cannot focus only on learning specific technologies: scaling for HPC supposes a change in the way of analysing, planning and building software.
Scientists must ask themselves how to analyse and plan a parallel software for HPC. What is needed to build it, and where to learn about it by researching the available bibliography, consulting with qualified experts, and differentiating between the challenges that scientists and researchers may face and those that software engineers should tackle.
Sequential programming seeks to make logical connections along the code instructions. The better connected the code instructions are, the better performance the algorithm has. However, parallel programming requires a change in perspective, forcing researchers and engineers to a paradigm shift.
Dr Peter Pacheco, Professor Emeritus at the University of San Francisco, confirms that a paradigm shift is needed since “the problems that arise in the development of parallel programs are fundamentally different from the problems that arise in serial programming” and offers race conditions as an example.
Focusing on scientists who are just trying to solve their research problem, Dr Christopher Woods, Head of Research Software Engineering at the University of Bristol, indicates that “writing a program needs to stop being seen as writing a recipe” and recommends that researchers “view the process as assembling a data processing factory, having to specify what tasks need completing, and what depends on what”. He concludes that “the paradigm shift is a much greater embrace of functional programming ideas such as map-reduce and the use of collective operations or already-parallelised libraries, such as the automatic parallelisation of Fortran”.
It becomes necessary to understand that in parallelisation, the computational instructions inside a parallel code block remain sequential. Even the main process that manages the parallel blocks is sequential, and only independent tasks can be executed simultaneously. Notwithstanding, it is no longer just a question of creating logical connections among instructions but accurately dividing the proper code blocks. So, a first step for scaling a program from sequential to parallel computing may involve localising the code blocks that can be executed independently. However, this task is not so easy because it is not only about identifying independent blocks. It requires a previous understanding of the problem that is desirable to solve using parallelism to decide how to split and parallelise it.
From a computer science perspective, it is worth asking whether a methodological consensus or a scientific methodology exists to analyse and solve a problem from a parallel computing context. An accepted methodology approach – still in force nowadays – was given by Dr Ian Foster in 1995. This four-step methodology allows for isolating the hardware-dependent aspects from those that are not. So, the first two steps, partitioning and communication, focus on concurrency and scalability. In contrast, the last two, agglomeration and mapping, focus on performance matters that largely depend on the hardware configuration.
Notwithstanding, Dr Woods assures that “only experts with a deep knowledge of a problem have a chance of writing efficient performing code if they manage communication and scheduling themselves”, so a more straightforward methodology for researchers would be “to identify the tasks to perform, describing the data that needs passing from one task to the next, and the dependencies between them” and finally, “delegating communication and scheduling to the computer”.
To identify the problem to solve using parallelism, the Lawrence Livermore National Laboratory (LLNL) suggests using a profiling tool to localise the code blocks where the program spends most of its time or where most of the instructions are executed. It also proposes determining which code blocks are significantly slower.
When the problem is known and the accurate code blocks where the issues reside are identified, it is time to split it. This procedure is known as partitioning and can be achieved by dividing the data, a process called domain decomposition, or splitting the tasks, which is called functional decomposition.
From a parallel computing perspective, domain decomposition is applied by splitting the data among the cores so each core executes the same tasks on different data blocks. This kind of parallelism is known as data parallelism. Functional decomposition, instead, is applied by splitting the tasks among the cores so that each core executes different tasks on the same data blocks. This kind of parallelism is known as task parallelism.
For example, if it was necessary to perform one task on a large quantity of data, it could be split into smaller data blocks so that each could be processed in a different core. If more than one task is needed, one of them could be the more expensive one, where the program spends most of its time. This task could be a bottleneck, and a better solution could be to split the tasks so that one core executes that costly task and another executes the rest. However, what will those independent tasks do to merge their results? This is where more complex aspects of parallelism, such as communication and synchronisation, begin to be necessary.
A common requirement is that two or more tasks need to merge their results. This demands communication and synchronisation between each process.
Most authors and experts consulted seem to agree that a design goal is to minimise communication among tasks. Therefore, a balance in the decomposition degree might be desirable since a higher decomposition degree results in a higher level of tasks, so more communication will be needed among them.
Since synchronisation is not just about making an ordered list of tasks, Dr Pacheco explains that “more powerful parallel programs are written using explicit instructions for parallelism”, so writing effective parallelisation requires indicating – at a low level – what core executes what task and when it must do it. This could be faced by using specific libraries, which will be discussed later
A more familiar matter is identifying the critical sections where a nondeterministic problem could be produced if two or more threads or processes try to modify the same data at—more or less—the same time. Those sections require deciding the best mutual exclusion approach to avoid race conditions.
Grappling with mutual exclusion can include implementing methods such as lock. However, Dr Subodh Kumar, in his book Introduction to parallel programming, suggests using compare and swap (CAS) synchronisation instead of a lock-based one to reduce overheads. Overhead is a common problem involved in parallelism. It refers to the situation where a task demands a high consumption of the available resources. If a thread is locked because another one is still executing, it must wait before it can continue, producing a thread overhead. According to Dr Kumar, it can be avoided by using a CAS synchronisation instead of a lock-based one.
Another complex matter is deciding the better load balance, which will mean choosing how many tasks or data blocks will be sent to each core. In data parallelisation, it might seem easy to determine how much data should be managed by each core. It makes sense to think that perhaps it would be a question of splitting the data blocks into equal parts—such as n-bytes or n-words per data block—but how can it be done to split tasks into equal parts? Unfortunately, the answer is not easy. Consulted on this matter, Dr Pacheco explains: “when there is no clear equivalence between the amount of work associated with two different tasks, one possible approach may be to execute the two tasks sequentially, finding parallelism within each task. A second approach, if there are many inherently sequential tasks, might be to create a repository for the tasks and have the threads or processes select tasks from it.”
Dr Woods emphasises that for non-experts, it is too complex to do it manually, so he recommends that non-experts, “express the tasks and their dependencies, and then let the task-based parallel library do the load balancing.”
A further particular concern is modularity. Dr Pacheco reinforces this fact for a parallel program: “In general, the complexity of parallel program development is much greater than the complexity of serial program development. So it’s essential that techniques such as modular design be employed. It is far more difficult to debug a subroutine that’s being executed by multiple threads or processes. So a programmer will benefit tremendously from the existence of well-defined, independent blocks of code.”
Finally, another matter to consider is choosing between GPU or CPU programming. The selection cannot be random since the difference is substantial. Dr Pacheco explains the performance differences obtained in SIMD and MIMD systems, allowing the programmer to make that choice: “Programming SIMD systems is very different from programming MIMD systems, and since GPUs are composed of SIMD systems, to obtain reasonable performance, it’s essential to be aware of the differences. For example, in MIMD programming, it’s often the case that when different threads or processes execute different conditional branches, there is very little (if any) degradation in performance. However, in a SIMD system, even if the threads or processes are executing tasks requiring similar amounts of work, only one branch can be executed at a time, and the work may be serialised.”
Therefore, that difference could mean CPU programming might be a better alternative for algorithms with heavy logical statements.
HPC infrastructure is mainly constructed over GNU/Linux systems. Most of them implement a RedHat Enterprise Linux distribution. Hence, having a knowledge base about Linux commands and essential GNU tools is indispensable.
Among the chosen programming languages for parallel and HPC are the classic ones such as C/C++ and Fortran, scientific-oriented languages such as Julia, and scripting languages such as Python. Knowing which compilers are present on the HPC infrastructure is crucial for compiled languages. GNU, Intel and Nvidia compilers are the most common.
Some helpful parallel programming middlewares for HPC are:
Some frameworks, such as charm++ (recommended by Dr Woods) for parallel programming in C++, or Apache Hadoop, to perform the map-reduce operations—mentioned by Dr Woods—are eligible for a high-level data parallelisation.
Finally, to run a finished work on an HPC environment, it is necessary to submit and schedule the job, so knowing the basic commands to use a workload manager is desirable. In the last few years, Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM) has increased in popularity and is widely used across academic and enterprise HPC systems.
Beyond the official documentation for each technology, there is a wide range of options to learn and improve HPC skills. The following lines offer a brief from online resources and books to specialised training centres.
Barney, B., & Frederick, D. (n.d.). Introduction to Parallel Computing Tutorial. HPC at Lawrence Livermore National Laboratory (LLNL). Retrieved 26 August 2022, from https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial
Pacheco, P. (2011). How do we write parallel programs? In An Introduction to Parallel Programming (1st ed., p. 8). Morgan Kaufmann.
A roundup of software tools available to scientists using HPC and AI software
Scientists and researchers are using AI to help accelerate the discovery of new drugs for a wide variety of different medical applications.
Integrators and cloud providers help facilitate access to HPC and deliver additional expertise and support, which helps scientists to effectively use computing resources, finds Robert Roe
Today’s DNA sequencing technologies now make it possible to sequence whole human genomes cost effectively and with speed.
The world of MultiBody Dynamics simulation is changing, writes Gemma Church
Quantum technology is going through a period of rapid development, with several technologies driving the adoption of this emerging computing framework, finds Robert Roe
A roundup of cloud technology providers that support researchers using HPC