Parallel Programming

Parallel programming can be a powerful tool for improving the performance of code and for taking advantage of the capabilities of modern hardware.

Main advantages:

  • performance
  • cheaper than sequential implementation
  • basically the ”big problems” can only solved by parallel algorithms

Brief history of parallel programming

PP has continued to increase during time driven by the development of new hardware architectures and the need for more computational power. Some of the main steps of PP evolution include:

  • (80s and 90s): the development of high-level languages and parallel libraries made it possible applying PP to a wider range of applications.
  • (2000s): the widespread adoption of multicore microprocessors began to drive the development of new parallel programming techniques and tools.
  • (10s): the rise of manycore architectures, such as GPUs, led to the development of new parallel programming approaches, such as data parallelism and task parallelism.
TECHNOLOGYTYPEYEARAUTHORS
Verilog/VHDLLanguages1984//1987US Government
MPILibrary1994MPI Forum
PThreadLibrary1995IEEE
OpenMPC/Fortran Extensions1997OpenMP
CUDAC Extensions2007NVIDIA
OpenCLC/C++ Extensions + API2008Apple
Apache SparkAPI2014Berkeley

Manycore architectures are becoming increasingly common in modern computing, and are often used in applications that require a lot of computational power, such as data analysis, scientific simulations, and machine learning. A GPU, or graphics processing unit, is a type of manycore architecture that is specifically designed for high-performance graphics and parallel computation. GPUs are typically composed of hundreds or thousands of small, simple cores that are optimized for performing the parallel computations required for rendering graphics.

Independency from architecture and automatic parallelization

The design of parallel algorithms should focus on the logical steps and operations that are required to solve a particular problem, rather than the details of how those steps will be executed on a specific hardware platform. But the performance of a parallel algorithm can vary significantly depending on the hardware on which it is run. Therefore design a “good” parallel algorithm is not enough: which parallelism is available on the considered architecture is more important since non suitable parallelism can introduce overhead.

TECHNOLOGYTarget Independent Code?Development Platforms
Verilog/VHDLYes (behavioral) No (structural)Mainly Linux
MPIYesAll
PThreadYesAll - Windows through a wrapper
OpenMPYesAll - Different compilers
CUDADepend on CUDA capabilitiesAll
OpenCLYesAll - Different compilers
Apache SparkYesMainly Linux

Automatic Parallelization

Automatic parallelization refers to the process of using a compiler or other tool to automatically transform sequential code into parallel code without requiring explicit parallelization directives from the programmer. In practice, it is not always possible for the compiler to accurately parallelize code. For example the compiler is not able to infer if 2 pointers of 2 arrays are pointing different region of RAM and are not overlapping while the programmer knows how to design the parallel algorithm. So parallelization by hand is predominant and the programmer needs to give hints to the tool: the concept is that the programmer needs to describe the parallelism to the compiler to make it exploitable.

Dependency Analysis

To determine if a set of statements can be executed in parallel, we have to do an analysis since not everything can be executed in parallel.

In general to execute in parallel: - statement order must not matter - statements must not have dependencies

The Dependency Analysis is performed over IN and OUT sets of the statements. The IN set of a statement is the set of memory locations (variables) that may be used in the statement. The OUT set of a statement is the set of memory locations that may be modified in the statement.

Often loops can be parallelized (we have to “unroll” the iterations), but there are also loop-carried dependencies, which often prevent loop parallelization. A loop-carried independence is a independence between two statements that is present only if this two statements are in two different iterations of the same loop.

Code extensions and languages for parallelism

The advantages of using extensions (such as OpenMP or MPI) of sequential languages instead of native parallel languages are many:

  • Familiarity and ease of use
  • Interoperability: this can be especially useful for codebases that use a mix of sequential and parallel code.
  • Portability: Extensions of sequential languages are often designed to be portable across different platforms and architectures.

Overall, the advantages of using extensions of sequential languages instead of native parallel languages depend on the specific needs and goals of the code being developed.

Mainly the 3 macro paradigms of parallelism are:

  • Single instruction, multiple data: most of the modern GPUs
  • Multiple instruction, single data: experimental
  • Multiple instruction, Multiple data: threads running in parallel on different data
TECHNOLOGYSIMDMISDMIMD
Verilog/VHDLYesYesYes
MPIYesYesYes
PThreadYes(Yes)Yes
OpenMPYesYesYels
CUDAYesNoYes)
OpenCLYes(Yes)Yes
Apache SparkYesNoNo

We can classify parallelism over different levels:

  • bits level: it is very relevant in hardware implementation of algorithm.
  • instructions level: different instructions executed at the same time on the same core. This type of parallelism can be easily extracted by compilers.
  • tasks level: a logically discrete section of computational work.
TECHNOLOGYBitInstructionTask
Verilog/VHDLYesYesNo
MPI(Yes)(Yes)Yes
PThread(Yes)(Yes)Yes
OpenMP(Yes)(Yes)Yes
CUDA(Yes)No(Yes)
OpenCL(Yes)NoYes
Apache Spark(Yes)No(Yes)
TECHNOLOGYParallelismCommunication
Verilog/VHDLExplicitExplicit
MPIImplicitExplicit
PThreadExplicitImplicit
OpenMPExplicitImplicit
CUDAImplicit(Explicit)Implicit(Explicit)
OpenCLExplicit/ImplicitExplicit/Implicit
Apache SparkImplicitImplicit

{width=50%}

Main features of studied languages

PThread

PROS: - different architectures - explicit parallelism and full control / freedom CONS: - task management overhead - low level API - not scalable

OpenMP

PROS: - easy to learn - scalable - parallel applications could be executed also sequentially without modifications CONS: - mainly focused on shared memory homogeneous systems - require small interaction between tasks

MPI

PROS: - different architectures - scalable solutions - communication explicit CONS: - communication can introduce overhead - programming paradigm more difficult