Sunday 8 February 2015

Parallel Programming for BigData


Before delving into the details of parallel programming I thought it would first be useful to provide a brief overview of the development and trends in programming, and then illustrate how parallel programming fits into the picture. There are various ways to classify programming languages, but perhaps the simplest way is by their “ease of use” or abstraction from the nitty-gritty processes undertaken at a machine level. The tendency in the development of programming languages has been to hide away more and more of the low level details from the programmer. This makes the language more user friendly, however, in many cases at the cost of performance.
The concept of parallel programming is to develop a code that operates efficiently across multiple central processing units (CPU) and completes the task at hand much faster than what could be done if only one CPU was adopted. This is in many ways independent of the programming language you choose to develop your project. Before delving into the details of parallel programming I thought I would give a brief overview of the available programming languages.

Programming languages are typically referred to as belonging to a particular generation, with each generation more user-friendly than the next. However, this typically comes at the cost of reduced performance. Third generation languages (3GL) including both procedural (C), and object oriented languages (C++FORTRAN) are at present the work horses used for generating data in massively parallel scientific computing. Forth generation languages (4GL) abstract away more of the details with higher level mathematical algorithms built in (eg: linear algebra, optimisation, visualisation). This reduces the human time required to develop the code, but the programs typically run slower in comparison to 3GL codes. Example 4GL languages include PythonMatlab and R, which are often used to write codes to post-process data generated by 3GL codes. As an aside, to combat the reduced performance of 4GL programs one can replace the slow parts with code written in a 3GL (eg: replace slow Python code with FORTRAN compiled using f2py, or with C compiled using cython).


Regardless of the choice of language, the first step in developing a parallel program is to think about how to efficiently split up the problem at hand into separate tasks. The slow part in parallel codes is the communication between each of the processors, so its best to think about how to split up the problem in order to minimise the amount of communication required. The same code is seen by each CPU, with each CPU knowing only the total number of processors and its local processor number. You then build the logic of the code so that it behaves differently depending on the value of the local CPU number.



For example take the code used to calculate the principal components in a previous post. The first stage in calculating the principal components is to multiply two flow fields together at each point in space, then sum together the result at all the points. This process needs to be repeated for every combination of flow fields. There are two ways to split up the problem. Say for example we have 100 flow fields consisting of 1,000,000 points in space and we want to process them over 4 CPUs. The first way to split up the problem would be to send the first 100/4 = 25 flow fields to the first CPU, then the next 25 to the next CPU and so on, as illustrated in the figure below. However, to do the multiplication of the 1st field with the 26th field we would need to send all 1,000,000 points of the 1st field from CPU 1 to CPU 2, and so on for all of the other combinations.




The second way would be to send the first 1,000,000/4=250,000 points of all of the fields to the first CPU, then the next 250,000 points to the next CPU and so on, again as illustrated in the figure below. This way all of the multiplication combinations can be handled without any communication. Each CPU does all of the multiplications and sums the result within the domain it has access to. The local value of this sum is then added together across all of the CPUs. This means for each field pair, only one number needs to be shared, which is a significant reduction in the amount of communication required as compared to the previous approach.


In terms of the technical implementation, the main ways of handling this communication is to use MPI (Message Passing Interface). It is a programming language independent communication protocol. To program with MPI you just need to include the appropriate library, whether you are coding in C, C++, FORTRAN, Python, or any of the other supported languages. In the first parallelisation example discussed above, each of the snapshots are sent to and from each CPU using the commands MPISend and MPIReceive. In the second approach to add together each of the individual sums, you use the MPIReduce command. In fact the “reduce” part of the “map/reduce” algorithm, used in database software such as Hadoop, is in fact the MPIReduce command.


Modern High Performance Computing (HPC) clusters, however, are typically more complex and comprise of many CPUs each with multiple cores. Whilst MPI can also be used to handle the communication between cores on a given CPU, a more efficient and faster way of handling this inter-core communication is to use OpenMP (Open Multi-Processing). OpenMP is newer than MPI, and hence is currently supported in only C, C++ and FORTRAN.



Advanced scientific computing codes are programmed using hybrid MPI and OpenMP parallelisations. An example of a hybrid code is the one I have been using to undertake the numerical simulations of wall bounded turbulence developed by my colleagues from the Universidad de Madrid [1,2]. This particular code solves the fluid flow (Navier-Stokes) equations for the flow over a flat surface in a three-dimensional rectangular volume. For each MPI process the physical domain is split up into a series of blocks oriented in the flow direction. Within each block OpenMP parallelisation is used to split the block up in the wall normal direction. This allows each of the cores to solve the equations in different regions of the flow simultaneously, and communicate with each other when necessary. This code has been benchmarked on various clusters and has been shown to increase in speed proportionally with an increase in the number of cores used to parallelise it [3].



The video below is a fly-through of an example flow field generated by this code which I presented at a recent conference [4]. You are flying in the direction of the flow. The structures represent the individual eddies (or vorticites) in the simulation. It is actually a composite of two different simulations. The green structures on the right are from the simulation of the flow over a flat surface. The red structures on the left are from a simulation in which I implemented a change to the code such that the flow thinks that it is flowing over a curved surface, like that of an aircraft wing, or wind turbine blade. You will notice at the end of the domain the red vortex structures are located further away from the surface. These simulations contain over 30 billion degrees of freedom, and were parallelised over 32,000 cores. This research was funded by the ARC, and the computational resources were provided by iVECNCI and PRACE.


One of the recent trends in HPC has been to develop clusters using graphics processing units (GPU) instead of (or in addition to) CPUs. Whilst a CPU has several cores, a GPU has several hundred cores. GPUs are incredibly fast for linear algebra operations, as they are built to operate on all elements of a vector simultaneously. For example multiplying the RGB (red,green,blue) values of pixels in an image to those of another image. Programming for GPUs also require a different communication protocol (OpenCL). GPUs are not only used for graphics processing, but also for scientific computing more broadly. They are ideally suited for processes that do not require a great deal of communication between cores. They also have less on board memory as compared to a CPU. The types of simulations discussed in this post require a significant amount of communication between cores, so at this stage GPUs are not the ideal tool for these types of problems.


As an aside, a list of the 500 largest HPC clusters in the world can be found here, which are utilised by mainly academic research institutions. Amazon also have their own commercialised solution (EC2) that the general public can access and pay for. 



References:

[1] Simens, M. P., Jimenez, J. Hoyas, S. & Mizuno, Y., 2009, A high-resolution code for turbulent boundary layers, J. Comp. Phys. Vol. 228, pp 4218-4231.
[2] Borrell, G., Sillero, J. & Jimenez, J. 2013, A code for direct numerical simulation of turbulent boundary layers at high Reynolds numbers in BG/P supercomputers, Comp. Fluids, Vol. 80, pp 37-43.
[4] Kitsios, V., Atkinson, C., Sillero, J.A., Borrell, G. Gungor, A.G., JimĂ©nez, J. & Soria, J., Boundary condition development for an equilibrium adverse pressure gradient turbulent boundary layer at the verge of separation, IUTAM Symposium on advances in computation modeling and control of transitional and turbulent flows, Goa, India, 15-18 December, 2014.