Shared Memory Multiprocessor

HPC Compages 1

Thomas Sterling , ... Maciej Brodowicz , in High Performance Computing, 2018

2.viii.i Shared-Memory Multiprocessors

A shared-retention multiprocessor is an architecture consisting of a modest number of processors, all of which accept straight (hardware) access to all the main memory in the system (Fig. 2.17). This permits whatever of the system processors to access information that whatever of the other processors has created or will use. The fundamental to this grade of multiprocessor architecture is the interconnection network that direct connects all the processors to the memories. This is complicated by the need to retain enshroud coherence across all caches of all processors in the organisation.

Figure 2.17. The shared-memory multiprocessor architecture.

Cache coherence ensures that any change in the data of one cache is reflected by some change to all other caches that may have a copy of the same global information location. It guarantees that any data load or shop to a processor register, if acquired from the local cache, will exist correct, fifty-fifty if another processor is using the same information. The interconnection network that provides cache coherence may employ any ane of several techniques. 1 of the earliest is the modified exclusive shared invalid (MESI) protocol, sometimes referred to every bit a "snooping enshroud", in which a shared passenger vehicle is used to connect all processors and memories together. This method permits whatever write of one processor to retention to be detected by all other processors and checked to come across if the same retentivity location is cached locally. If then, some indication is recorded and the enshroud is either updated or at least invalidated, such that no error occurs.

Shared-memory multiprocessors are differentiated by the relative time to access the common retentivity blocks past their processors. A SMP is a arrangement architecture in which all the processors tin can access each memory block in the aforementioned amount of time. This capability is ofttimes referred to as "UMA" or uniform memory access. SMPs are controlled by a single operating system across all the processor cores and a network such equally a omnibus or cross-bar that gives straight access to the multiple retention banks. Admission times can nonetheless vary, as contention between 2 or more processors for any single memory bank will delay access times of one or more processors. But all processors still have the same chance and equal admission. Early on SMPs emerged in the 1980s with such systems as the Consecutive Residual 8000. Today SMPs serve as enterprise servers, deskside machines, and even laptops using multicore chips, and thus play a major role in the medium-calibration computing which is a major role of the commercial market. SMPs also serve as nodes within much larger MPPs.

Nonuniform retention access (NUMA) architectures retain access by all processors to all the primary memory blocks within the organisation (Fig. ii.eighteen). But this does non ensure equal access times to all retentivity blocks by all processors. This is motivated by the compages opportunity provided by mod microprocessor designs to exploit loftier-speed local memory communication channels while providing access to all the memory through external, albeit slower, global interconnection networks. NUMA architectures benefit from scaling, permitting more processor cores to be incorporated into a single shared-memory system than SMPs. However, because of the departure in retentiveness access times, the programmer has to be witting of the locality of information placement and use information technology to have best advantage of computing resources. NUMA multiprocessor architectures offset emerged with such systems every bit the BBN Butterfly multiprocessors, including the GP-g and the TC-2000.

Figure 2.18. Nonuniform memory access architectures retain access past all processors to all the primary retentiveness blocks within a system, but does not ensure equal access times to all memory blocks past all the processors.

Read total affiliate

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780124201583000022

Hardware/Software Co-Synthesis with Memory Hierarchies

Yanbing Li , ... Fellow, IEEE, in Readings in Hardware/Software Co-Design, 2002

C Cache Coherency

In a shared-memory multiprocessor architecture, caching of shared data introduces the cache coherency trouble. The data dependencies (RAW, WAW, and WAR) in the task graph enforces that 2 tasks with data dependencies do not execute at the same time. This ensures the right order of execution. However, a task may nevertheless write a information in its local cache that has several copies in the caches of other processors.

In our algorithm, we use the write invalidate protocol. A write on one processor will invalidate all other copies of the same data on other processors to ensure this processor has sectional access to the data. Afterwards a chore finishes its execution, if the information are "muddied" (have been written), they are written back to the master retentivity such that the updated information tin can be used past other tasks. Note that there is no demand to write to the main memory during the execution of a task (say a), because whatsoever other tasks that are data-dependent on a do not offset running until a is finished. When a task accesses its information through its local cache, an invalid flag associated with a block of data indicates the data needs to be reloaded from the master memory.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9781558607026500235

Parallel Calculating

S.C. Allmaier , ... D. Kreische , in Advances in Parallel Computing, 1998

4 SUMMARY AND OUTLOOK

Nosotros presented a parallel solution for the problem of generating reachability graphs of GSPNs both for shared and distributed memory multiprocessors. In the case of shared memory multi-processors we tin conclude that it is possible to shorten analysis times for Petri nets with huge state spaces significantly if suitable parallelization methods are practical. For workstation clusters our parallelization enabled us to clarify GSPN models whose reachability graphs would be by far too large to be generated using a single car of the cluster.

The conclusions to a higher place besides give rising to a comparative evaluation of the different multiprocessor architectures we used for our implementations: The algorithm for distributed memory machines has to be much more circuitous to recoup for the missing global data access. Therefore, no real speedup can be gained compared to the shared retentiveness case and assay times are longer. On the other hand, the distributed memory solution enables the analysis of big models on commodity hardware (as workstation clusters with high-bandwidth connections become more than and more widespread), whereas the shared memory implementation needs an expensive loftier-end server or a supercomputer when large models are to be evaluated. If such a auto is available, parallel generation of the reachability graph may accelerate the analysis process considerably.

We plan to do more measurements on other parallel machines in the near futurity. Parallel implementations of the GSPN assay steps following reachability graph generation and the integration of our programs into an automated tool for consummate GSPN analysis are underway.

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/S0927545298800749

Shared Memory Parallelization of an implicit ADI-blazon CFD code

Th. Hauser , P.G. Huang , in Parallel Computational Fluid Dynamics 1998, 1999

2.3 Parallel Implementation using OpenMP

For the parallelization of the ADI-blazon solver, the recently developed OpenMP specification for programming shared-retentivity multiprocessors is used. SGI adopted the OpenMP standard for the ORIGIN series in the version 7.two.1 compiler and HP has promised that their implementation of the OpenMP specification will exist released in the upcoming compiler. Because OpenMP is a portable and scalable model that gives shared-memory programmers a simple and flexible interface for developing parallel applications, information technology is our belief that information technology will get the equivalent of MPI, the standard for distributed memory programming.

The LESTool lawmaking was parallelized by placing OpenMP directives in the outer loop inside the LHS and RHS operations. This involved decomposing the 3D trouble into groups of ID lines, with each grouping assigned to a dedicated processor. The efficiency of the parallel decomposition was enhanced past the use of the commencement touch policy, which is specific to the SGI Origin 2000. This implies that the retention allocated is physically placed in the node that touches the retentiveness location first. All big three dimensional blocks of memory, initialized in planes of constant k, were distributed into dissimilar nodes. This allows an easy parallelization for the i- and j-directions.

Later on finishing the computation in i- and j-directions, the solution in the k-direction is performed. On typical distributed memory-computers this presents a trouble considering the memory has been distributed in ij-planes and therefore no processor can access data along k-lines. In contrast, the solution in the j-direction poses no difficulty on a shared-retentivity computer. In the current approach the outer loop was called to be in the j-direction, and the 1-D partition of the ik-planes was parallelized.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444828507500802

Parallel Computing

East. Bucchignani , Westward.Yard. Diurno , in Advances in Parallel Computing, 1998

v RESULTS

The numerical simulations have been executed on the SGI Ability Claiming supercomputer installed at CIRA. It is a shared retentivity multiprocessor machine with 4 GBytes of physical memory and with 16 R10000 processors. These are RISC superscalar 4 way processors: each of it has two integer units, ii floating point ones and up to four outstanding cache miss. The main characteristics of this machine are: Peak power: 392 MFLOPS per processor, Enshroud L2: 1600 Mbytes, Bandwidth: 1600 Mbyte/s, Latency: 0.90 μs. The maximum bandwidth obtained with MPI is 62.6 Mbyte/s, with a latency of 13 μs.

The results obtained with the 2nd version of the lawmaking have already been presented in [eleven] and are reported here in Tab. ane for abyss.

The three-dimensional test example considered is the study of a transonic inviscid flow around a M6-wing, discretized with a tetrahedral mesh. The unstructured grid is made up of 57041 points, 306843 tetrahedral elements and 372102 edges. Results reported in Tab. ii are referred to chiliad iterations. The mesh partitioning has been executed by ways of METIS and by means of the RCB algorithm. METIS guarantees an optimal load balancing among the processors: in fact, the same number of elements (more or less 1) is assigned to each processor. RCB does not guarantee an optimal load balancing, just it can exist executed in a negligible fourth dimension. In fact, for example, the partitioning in eight subdomains executed by METIS takes nearly 14 seconds, while the same segmentation performed by RCB takes about 0.5 seconds.

Table 2. HUGENS - M6 Fly - 306843 tetrahedrons, 57041 points, 372102 edges, chiliad it.: Time and Speed-Up.

Power Claiming
METIS RCB
n. pr. time sp.-up time sp.-up
one 5586 1.00 5586 i.00
2 2757 two.02 3103 one.80
4 1382 iv.04 1460 3.82
eight 716 seven.eighty 843 half-dozen.62
xvi 482 11.59 658 viii.48

Tab. iii shows the average corporeality of data to be exchanged by each processor (expressed in Kbytes) and the related communication fourth dimension (expressed in μseconds) required at each iteration.

Table 3. HUGENS - M6 Fly - 306843 elements, 57041 points, 372102 edges: amount of communications (expressed in Kbytes) and comm. time (expressed in μsec.) required at each iteration.

Ability Challenge
METIS RCB
n. pr. comm. time comm. time
1 - - - -
2 198 20780 309 33430
4 205 68305 387 85180
8 233 91400 475 96460
16 232 90730 401 88540

The assay of these tables highlights that, when METIS is employed, a superlinear speed-up can be obtained with ii and 4 processors, due to cache effects which compensate the time lost for communications. A very loftier efficiency is observed also with 8 processors, while a loss of efficiency is registered just with sixteen processors, because the daemon process must share one processor with a computational thread. On the other side, when the RCB algorithm is employed, the parallel performances are not so good, due to the unbalancement of the workload and to the larger amount of communications required. Fig. 1 shows the convergence history (residual versus iterations) as a part of the number of processors employed. It is clearly evident that, at least for this test case, the convergence history does not appreciably change with the number of processors, and this is a proof of the robustness of the parallel lawmaking.

Figure i. Convergence history as a function of the number of processors employed

The flow conditions for the 3D test instance over the M6 wing are: M = 0.84 and α = 3.06°. The menses develops a supersonic region on the upper surface of the wing delimited by a shock moving ridge that impinged the wing body just after the half of the cord in the root wing department, while it touches near at the leading edge on the tip. On the wing surface the pes-impress of the stupor wave highlights a typical lambda-shape.

Fig. 2 and 3 show respectively the particular of the mesh near the fly and the CFD solution in terms of Mach number isolines. While the grid is acceptable to show a right physical behavior, it is still rough to have a skilful resolution mainly in the daze region. Nevertheless for the purpose of the present work a further refinement of the grid has been considered irrelevant. For the 2d case nosotros used a NACA0012 airfoil under the conditions: 1000 = 0.8 and α = ane.25°. We take used a fine mesh of 148292 points with 295786 triangles and the results matched well with those of the AGARD-211 [12] as shown in Fig. four which reports the pressure level coefficient over the profile. The Lift and Drag coefficients obtained are respectively 0.3305643 and 2.1088321E-02 that are in good understanding with those reported on the AGARD notes, computed by other codes.

Effigy two. Particular of the mesh

Effigy iii. Mach Number isolines

Effigy iv. Pressure coefficient distribution effectually a NACA0012 airfoil

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0927545298800142

Colorless Wait-Free Computation

Maurice Herlihy , ... Sergio Rajsbaum , in Distributed Computing Through Combinatorial Topology, 2014

4.1.ane Overview

A distributed system is a fix of communicating state machines called processes. Information technology is convenient to model a process as a sequential automaton with a possibly space prepare of states. Remarkably, the set of computable tasks in a given system does non modify if the individual processes are modeled as Turing machines or as even more powerful automata with infinite numbers of states, capable of solving "undecidable" issues that Turing machines cannot. The important questions of distributed computing are concerned with communication and dissemination of knowledge and are largely independent of the computational power of individual processes.

For the time being, we will consider a model of ciphering in which processes communicate by reading and writing a shared memory. In modern shared-memory multiprocessors, ofttimes called multicores, retention is a sequence of individually addressable words. Multicores provide instructions that read or write individual memory words one in a single atomic step.

For our purposes, we will use an arcadian version of this model, recasting conventional read and write instructions into equivalent forms that have a cleaner combinatorial structure. Superficially, this idealized model may not look like your laptop, just in terms of chore solvability, these models are equivalent: Any algorithm in the idealized model can be translated to an algorithm for the more than realistic model, and vice versa.

Instead of reading an private memory give-and-take, nosotros assume the ability to read an arbitrarily long sequence of contiguous words in a single atomic step, an performance nosotros telephone call a snapshot. We combine writes and snapshots as follows. An firsthand snapshot takes place in two contiguous steps. In the kickoff step, a procedure writes its view to a word in memory, possibly concurrently with other processes. In the very adjacent step, it takes a snapshot of some or all of the retentivity, peradventure concurrently with other processes. It is important to understand that in an immediate snapshot, the snapshot stride takes place immediately after the write step.

Superficially, a model based on immediate snapshots may seem unrealistic. Every bit noted, modern multicores exercise not provide snapshots directly. At best, they provide the ability to atomically read a minor, constant number of face-to-face retentivity words. Moreover, in modern multicores, concurrent read and write instructions are typically interleaved in an arbitrary order. 2 Withal, the idealized model includes immediate snapshots for 2 reasons. Kickoff, immediate snapshots simplify lower bounds. It is clear that whatever chore that is impossible using immediate snapshots is too impossible using single-give-and-take reads and writes. Moreover, we volition see that firsthand snapshots yield simpler combinatorial structures than reading and writing private words. Second, perhaps surprisingly, immediate snapshots practice not impact task solvability. It is well known (encounter Section iv.4, "Chapter Notes") that 1 tin construct a wait-free snapshot from single-discussion reads and writes, and we will run into in Affiliate 14 how to construct a expect-complimentary immediate snapshot from snapshots and single-word write instructions. It follows that whatever chore that can be solved using firsthand snapshots can be solved using unmarried-word reads and writes, although a direct translation may be impractical.

In Affiliate 5, we extend our results for shared-retentiveness models to message-passing models.

As many equally north of the n + 1 processes may neglect. For at present, we consider merely crash failures, that is, failures in which a faulty process simply halts and falls silent. Later, in Affiliate 6, we consider Byzantine failures, where faulty processes may communicate arbitrary, fifty-fifty malicious, information.

Processes execute asynchronously. Each process runs at an arbitrary speed, which may vary over time, independently of the speeds of the other processes. In this model, failures are undetectable: A nonresponsive process may exist boring, or it may have crashed, but at that place is no way for another process to tell. In later chapters, we will consider synchronous models, whereby processes take steps at the aforementioned time, and semi-synchronous models, whereby at that place are bounds on how far their executions can diverge. In those models, failures are detectable.

Remember from Chapter one that a task is a distributed problem in which each process starts with a private input value, the processes communicate with 1 another, and then each process halts with a private output value.

For the adjacent few chapters, we restrict our attending to colorless tasks, whereby it does not thing which process is assigned which input or which process chooses which output, just which sets of input values were assigned and which sets of output values were called.

The consensus task studied in Affiliate two is colorless: All processes agree on a unmarried value that is some process's input, but information technology is irrelevant which process's input is chosen or how many processes had that input. The colorless tasks embrace many, but not all, of the central problems in distributed computing. Later, we will consider broader classes of tasks.

A protocol is a programme that solves a job. For at present, we are interested in protocols that are wait-free: Each process must complete its computation in a bounded number of steps, implying that it cannot look for whatsoever other procedure. I might be tempted to consider algorithms whereby one procedure sends some data to another and waits for a response, merely the wait-free requirement rules out this technique, along with other familiar techniques such as barriers and mutual exclusion. The austere severity of the await-free model helps us uncover basic principles more clearly than less enervating models. Later, we will consider protocols that tolerate fewer failures or fifty-fifty irregular failure patterns.

Nosotros are primarily interested in lower bounds and computability: which tasks are computable in which models and in the communication complexity of computable tasks. For this reason we assume without loss of generality that processes use "total-data" protocols, whereby they communicate to each other everything they "know." For clarity, however, in the specific protocols presented here, processes usually transport only the information needed to solve the job at hand.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780124045781000048

Dataflow Processing

Zivojin Sustran , ... Veljko Milutinovic , in Advances in Computers, 2015

3.three.1 The Split Data Cache in Multiprocessor Arrangement

Sahuquillo and Pont [7] proposed a few extensions to the STS enshroud arrangement, presented in Ref. [2] , in society to adapt it for a shared memory multiprocessor environment. In GMN systems, a tag for data is kept until data eviction, and when the tag for data gets changed, the data is relocated to another sub-cache organization. As depicted in Fig. 9, a snoop controller is added to the 2d-level temporal sub-cache arrangement and to the spatial sub-cache system. In conjunction to the STS cache organization, an extension of the Berkeley cache coherence protocol is used, as depicted in Fig. 10. When a hit occurs in 1 sub-cache organization, it sends a signal to another one to finish information technology from accessing the bus and passes the requested data to the processor. When a miss occurs, the spatial sub-cache system requests the data on the bus and accepts the received data. Both sub-cache systems snoop into the omnibus for invalidation betoken, and, when necessary, invalidate the data. Simulation shows that an STS system in a multiprocessor surround tin reach the same performance as a conventional cache, while occupying smaller space on the die. In lodge to avoid the "cold-start" period for information existence read once again later on information technology has been evicted, it is possible to enable tag history, especially when eviction of shared block happened equally a result of a write performance by some other processor.

Figure nine. Carve up data cache system in a multiprocessor organisation environment. Legend: MUX—multiplexer; BUS—system bus; SC—spatial sub-enshroud with prefetching mechanism; TAG—unit for dynamic tagging/retagging of data; TC L1 and TC L2—the first and the second levels of the temporal sub-cache; SNOOP—snoop controller for the cache coherence protocol. Description: the cache organization is divided into spatial and temporal sub-enshroud systems. Each sub-enshroud system has an associated snoop controller. Explanation: the divided cache system allows the use of different caching strategies for dissimilar types of localities that data exhibits. A information block is cached into a specific sub-cache system based on the associated tag. The tag for each data block is determined by the history of the previous accesses to that data block, the ones that happened afterwards the last time the block is fetched. Implication: data tin change the sub-cache organization in which it is stored. But the spatial sub-enshroud system fetches information.

Figure 10. The modified Berkeley cache coherence protocol. Legend: I—invalid state; US—unmodified-shared state; MS—modified shared state; ME—modified sectional land; PR—processor read; Prisoner of war—processor write; BR—bus read; BW—passenger vehicle write; U—requested block is unmodified; M—requested block is modified. Clarification: the shown cache coherence protocol must be used in the STS organisation, if it is deployed in a multiprocessor environment. Explanation: read miss fetches a block in the Us country or in the MS country, depending on whether the block is unmodified or modified in other parts of the memory organization. Read hit does non alter the state of a block. Write to a block changes the land of the block to ME and sends an invalidation signal to the double-decker. Implication: if the block is in the ME state when its eviction happens, the block is written back to the memory.

The instance in Fig. eleven presents the code for calculating the sum of elements in an array. In a parallelized algorithm, each processor calculates a partial sum of elements for an unshared part of the assortment. The partial sum is saved in an unshared variable. When the partial sum is calculated, its value is added to global sum. The global sum is saved in a global variable for which the synchronization is necessary. When the calculation is over, each processor switches to another part of the array. An STS cache in a multiprocessor system is used in executing this code and the assumption is that the assortment is relatively large.

Figure eleven. An example of an STS enshroud in a multiprocessor organisation. Legend: N—arbitrary constant; i—loop counter; a—array; sum—unshared scalar variable; x—shared scalar variable; lock, unlock—synchronization primitives. Clarification: a parallel algorithm that calculates the sum of elements in an array. Caption: the algorithm is executed on a system with an STS cache in a multiprocessor system, with the assumption that the array is relatively big. The loop is typically executed several times at execution time. Implication: the assortment is buried in the spatial sub-cache system and the scalar variables are cached in the temporal sub-cache organisation. Simply the eviction of shared scalar variables is necessary, because the invalidation cycle is executed after every write.

Initially, all accessed information is copied into the spatial sub-cache system. The counter count_X is increased with every access to scalar data (because scalar data resides in the upper half of the cake and the lower half of the block does not comprise data that is existence accessed by the code) until it reaches the upper limit. The blocks with scalar information are retagged as temporal data blocks and transferred to the temporal sub-enshroud system. Accesses to data in the assortment are sequential with unit of measurement pace. The counter count_X contains the value zero for all data blocks that incorporate the array, and those blocks volition stay in the spatial sub-cache organisation until evicted. The unshared scalar data volition stay in the temporal sub-enshroud system during all iterations of the lawmaking, because blocks in the temporal sub-cache system have the size of one discussion and the invalidation of the unshared scalar information cannot happen. The shared information volition be invalidated after every write. If the assortment is properly divided, the invalidation of blocks that incorporate parts of the array will not happen. This example shows that information exhibiting only temporal or simply spatial locality is correctly detected by run-fourth dimension algorithms. It also indicates that the number of invalidation cycles can exist reduced, by overlapping some of the activities.

The essence of the approach of Sahuquillo et al. is to bring the SST approach into the multiprocessor environs, for utilization in the Shared Multiprocessors' systems and the Distributed Shared Memory systems. The most critical aspect of whatsoever concept extension research is to provide extensibility without jeopardizing on the performance issues that launched the concept in the first place. Sahuquillo et al. managed to solve this trouble by applying an approach based on the efficient tagging of the history of data behavior.

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/S0065245814000060

The Essential OpenMP

Thomas Sterling , ... Maciej Brodowicz , in High Operation Computing, 2018

vii.vi Summary and Outcomes of Chapter 7

"OpenMP" stands for "open multiprocessing".

OpenMP is an API for parallel computing that has bindings to programming languages such every bit Fortran and C.

OpenMP supports programming of shared-retentivity multiprocessors, including SMP and distributed shared-memory classes of parallel computer systems.

OpenMP supports the fork–bring together model of parallel computing. At particular points in the execution the master thread spawns a number of threads and with them performs a function of the programme in parallel. The indicate of multiple worker thread initiation is referred to as the fork. Usually all these threads perform their calculations separately, and when they come to their respective completion they wait for the other threads to terminate at the bring together of the parallel threads.

OpenMP provides surround variables for controlling execution of parallel codes. These can exist set from the OS command line or equivalent prior to execution of the application plan.

Runtime library routines help manage parallel awarding execution, including accessing and using environment variables such as those above. The library routines are provided in the omp.h file and must exist included (#include <omp.h>) prior to using any of these routines.

Threads are the principal means of providing parallelism of ciphering. A thread is an independently schedulable sequence of instructions combined with its private variables and internal control. Usually in that location are as many threads allocated to the user computation as at that place are processor cores assigned to the ciphering, although this is non required.

omp_get_num_threads() returns the total number of threads currently in the group executing the parallel block from where it is called.

omp_get_thread_num() returns a value to each thread executing the parallel code block that is unique to that thread and tin be used every bit a kind of identifier in its calculations. When the primary thread calls this office, the value of 0 is always returned, identifying its special office in the ciphering.

OpenMP directives are a principal class of constructs used to catechumen initially sequential codes incrementally to parallel programs. They serve a multitude of purposes, primarily nearly controlling parallelism through delineation and synchronization.

The parallel directive delineates a block of code that will be executed separately past each of the computing threads.

The parallel for directive permits piece of work sharing of an iterative loop among the executing threads, with one or more iterations performed by each thread.

The individual clause in a directive establishes that each thread has its ain copy of a variable, and when accessing that designated variable will read or write its own individual re-create rather than a shared variable.

The sections directive describes separate lawmaking blocks, each of a dissimilar sequence of instructions, which may be performed concurrently. There is ane thread allocated to each lawmaking cake.

Synchronization directives define the mechanisms that help in coordinating execution of multiple parallel threads that utilise a shared context (shared memory) in a parallel program to preclude race conditions.

The critical directive provides mutual exclusion of access to shared variables by permitting only i thread at a fourth dimension to perform a given lawmaking block. When a thread enters the critical code section, all other threads that effort to do so are deferred until the thread doing it has completed. Other threads are then gratis to execute the disquisitional section of code themselves, but only 1 at a time.

The principal directive delineates a block of code that is only executed by the main thread, with all other threads skipping over it.

The single directive delineates a block of code that is performed past merely a single thread, but it can be any of the executing threads—whichever 1 gets to that code block first. All threads wait until the thread completing that lawmaking executes it.

The barrier directive is a form of synchronization. When encountering a given bulwark directive, all threads halt at that location in the code until all other threads take reached the aforementioned betoken of execution. Merely when all the threads have reached the bulwark can any of them go along beyond information technology. Once all the threads have performed the barrier operation, they all continue with the ciphering after it.

Reduction operators combine a large number of values to produce a unmarried upshot value. A number of operations tin can be used for this purpose, such equally + and | among others.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780124201583000071

Introduction

David B. Kirk , Wen-mei Westward. Hwu , in Programming Massively Parallel Processors (Second Edition), 2013

ane.5 Parallel Programming Languages and Models

Many parallel programming languages and models accept been proposed in the past several decades Mattson2004]. The ones that are the virtually widely used are Message Passing Interface (MPI) [MPI2009] for scalable cluster computing, and OpenMP [Open2005] for shared-retentiveness multiprocessor systems. Both have become standardized programming interfaces supported by major calculator vendors. An OpenMP implementation consists of a compiler and a runtime. A developer specifies directives (commands) and pragmas (hints) nigh a loop to the OpenMP compiler. With these directives and pragmas, OpenMP compilers generate parallel lawmaking. The runtime arrangement supports the execution of the parallel code by managing parallel threads and resources. OpenMP was originally designed for CPU execution. More recently, a variation called OpenACC (see Chapter fifteen) has been proposed and supported by multiple calculator vendors for programming heterogeneous computing systems.

The major reward of OpenACC is that information technology provides compiler automation and runtime support for abstracting away many parallel programming details from programmers. Such automation and abstraction can help brand the application lawmaking more than portable across systems produced past different vendors, as well as different generations of systems from the aforementioned vendor. This is why we teach OpenACC programming in Chapter xv. However, effective programming in OpenACC still requires the programmers to understand all the detailed parallel programming concepts involved. Because CUDA gives programmers explicit control of these parallel programming details, information technology is an excellent learning vehicle fifty-fifty for someone who would similar to employ OpenMP and OpenACC as their primary programming interface. Furthermore, from our experience, OpenACC compilers are even so evolving and improving. Many programmers volition likely demand to use CDUA-mode interfaces for parts where OpenACC compilers autumn short.

MPI is a model where computing nodes in a cluster do non share memory [MPI2009]. All data sharing and interaction must be done through explicit message passing. MPI has been successful in high-operation computing (HPC). Applications written in MPI have run successfully on cluster computing systems with more than 100,000 nodes. Today, many HPC clusters employ heterogeneous CPU–GPU nodes. While CUDA is an effective interface with each node, nearly application developers need to use MPI to program at the cluster level. Therefore, it is important that a parallel programmer in HPC understands how to practise articulation MPI/CUDA programming, which is presented in Chapter xix.

The amount of effort needed to port an application into MPI, however, can be quite high due to the lack of shared memory across computing nodes. The programmer needs to perform domain decomposition to sectionalization the input and output data into cluster nodes. Based on the domain decomposition, the programmer also needs to telephone call bulletin sending and receiving functions to manage the data exchange betwixt nodes. CUDA, on the other paw, provides shared memory for parallel execution in the GPU to address this difficulty. Equally for CPU and GPU communication, CUDA previously provided very limited shared retentiveness adequacy between the CPU and the GPU. The programmers needed to manage the data transfer betwixt the CPU and the GPU in a mode similar to the "one-sided" bulletin passing. New runtime back up for global address space and automated information transfer in heterogeneous computing systems, such every bit GMAC [GCN2010] and CUDA iv.0, are now bachelor. With GMAC, a CUDA or OpenCL programmer can declare C variables and data structures as shared betwixt CPU and GPU. The GMAC runtime maintains coherence and automatically performs optimized data transfer operations on behalf of the programmer on an every bit-needed basis. Such back up significantly reduces the CUDA and OpenCL programming complication involved in overlapping data transfer with computation and I/O activities.

In 2009, several major industry players, including Apple, Intel, AMD/ATI, and NVIDIA, jointly adult a standardized programming model called Open Compute Language (OpenCL) [Khronos2009]. Like to CUDA, the OpenCL programming model defines linguistic communication extensions and runtime APIs to allow programmers to manage parallelism and data delivery in massively parallel processors. In comparison to CUDA, OpenCL relies more than on APIs and less on language extensions than CUDA. This allows vendors to quickly conform their existing compilers and tools to handle OpenCL programs. OpenCL is a standardized programming model in that applications adult in OpenCL can run correctly without modification on all processors that support the OpenCL language extensions and API. However, one will probable need to modify the applications to reach high functioning for a new processor.

Those who are familiar with both OpenCL and CUDA know that at that place is a remarkable similarity between the cardinal concepts and features of OpenCL and those of CUDA. That is, a CUDA programmer tin can learn OpenCL programming with minimal effort. More importantly, virtually all techniques learned using CUDA can exist easily applied to OpenCL programming. Therefore, we innovate OpenCL in Chapter xiv and explain how i tin can apply the key concepts in this book to OpenCL programming.

Read total chapter

URL:

https://www.sciencedirect.com/scientific discipline/commodity/pii/B9780124159921000018

Concurrent objects

Maurice Herlihy , ... Michael Spear , in The Art of Multiprocessor Programming (Second Edition), 2021

three.4.2 Linearizability versus sequential consistency

Like sequential consistency, linearizability is nonblocking: There is a linearizable response to whatsoever awaiting call of a total method. In this way, linearizability does non limit concurrency.

Threads that communicate only through a single shared object (e.g., the retentivity of a shared-memory multiprocessor) cannot distinguish between sequential consistency and linearizability. Only an external observer, who can see that one operation precedes another in the real-time social club, can tell that a sequentially consistent object is not linearizable. For this reason, the difference betwixt sequential consistency and linearizability is sometimes called external consistency. Sequential consistency is a skilful way to depict standalone systems, where composition is not an outcome. Still, if the threads share multiple objects, these objects may be external observers for each other, equally nosotros saw in Fig. 3.8.

Unlike sequential consistency, linearizability is compositional: The result of composing linearizable objects is linearizable. For this reason, linearizability is a good way to depict components of large systems, where components must be implemented and verified independently. Because we are interested in systems that compose, virtually (but non all) data structures considered in this book are linearizable.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124159501000124