HLPP2001: International Workshop on
High-level Parallel Programming and Applications
26-27 March 2001,
Université d'Orléans, France
Château de la Source
- Tagged-token data-flow for skeletons
This paper discusses the use of a tagged-token macro data-flow
execution model for implementing skeletons within the SKIPPER
parallel programming environment. We show that it provides a
suitable implementation model for skeletons involving runtime-bounded
iterations and/or recursion such as data and task farms, especially
in the presence of nesting. The new version of SKIPPER relies on a
custom data-flow interpreter controlling SPMD based parallelism.
Input data-flow graphs are obtained from a skeletal program
specification written in CAML and making use of user defined sequential
functions written in C. Initial evaluation suggests that performance
close to handcrafted C with MPI can be achieved.
- Skeletons, BSP and performance portability
The Skeletal approach to parallel programming conjugates a high
level compositional style and efficiency. A second advantage of
Skeletal programming is portability since implementation decisions
are usually taken at compile time. The paper claims that an
intermediate model embedding the main performance features of
the target architecture facilitates performance portability across
parallel architectures. This is motivated by describing the
Skel-BSP framework which implements a skeleton system on top of a
BSP computer. A prototype compiler based on a set of BSP
templates is presented together with a set of performance models
for each skeleton which allow a local optimization.
The paper also introduces a global optimization strategy
using a set of transformation rules. This local+global
approach seems a viable solution to writing parallel software in a
machine-independent way (Writing Once and Compiling Everywhere).
- Weakest specifunctions for BSP
Yifeng Chen, J.W. Sanders
- Distributed evaluation of functional BSP programs
The BS-LAMBDA(p)-calculus is a calculus of functional
bulk synchronous parallel (BSP) programs. It is the basis for
the design of a bulk synchronous parallel ML language. For
data-parallel languages, there are two points of view: the
programming model where a program is seen as a sequence of
operations on parallel vectors, and the execution model where
the program is a parallel composition of programs run on each
processor of the parallel machine. BSP algorithms are defined
by data-parallel algorithms with explicit (physical) processes
in order to allow parallel execution times to be estimated.
We present here a distributed evaluation minimally synchronous
for BSP execution (which corresponds to the execution model).
This distributed evaluation is correct w.r.t. the call-by-value
strategy of the BS-LAMBDA(p)-calculus (which corresponds to
the programming model).
Component dependence metadata in adaptive parallel applications
Paul H.J. Kelly, Olav Beckmann, Tony Field, Scott B. Baden
This paper describes THEMIS, a programming model and run time
library being designed to support cross-component performance
optimization through explicit manipulation of the computation's
iteration space at run-time.
Each component is augmented with "component dependence metadata",
which characterizes the constraints on its execution order, data
distribution and memory access order. We show how this supports
dynamic adaptation of each component to exploit the available
resources, the context in which its operands are generated, and
results are used, and the evolution of the problem instance.
Using a computational fluid dynamics visualization example as
motivation, we show how component dependence metadata provides
a framework in which a number of interesting optimizations become
possible. Examples include data placement optimization, loop
fusion, tiling, memoization, checkpointing and incrementalization.
- A new way to divide and conquer
Valiant's model of bulk-synchronous parallel (BSP) computation does
not allow the programmer to synchronize a subset, rather than the
complete set of a parallel computer's processors. This is
perceived by many to be an obstacle to expressing divide-and-conquer
algorithms in the BSP model. We argue that the divide-and-conquer
paradigm fits naturally into the BSP model, without any need for
subset synchronization. The proposed method of divide-and-conquer
BSP programming is fully compliant with the BSP computation model.
The method is based on sequentially interleaved threads of BSP
computation, called superthreads.
- Shared memory implementation of constraint satisfaction problem
Zineb Habbas, Michaël Krajecki, Daniel Singer
Many problems in Computer Science, especially in Artificial
Intelligence, can be formulated as Constraint Satisfaction Problems
(CSP). This paper presents a parallel implementation of the
Forward-Checking algorithm for solving a binary CSP over finite
domains. Its main contribution is to use a simple decomposition
strategy in order to dynamically distribute the search tree
among machines. The feasibility and benefit of this approach
are studied for a Shared Memory model. An implementation is
drafted using the new emergent standard OpenMP library for shared
memory, thus controlling load balancing. We mainly highlight
satisfactory efficiencies without using any tricky load balancing
policy. All the experiments were carried out running on the
Silicon Graphics Origin 2000 parallel machine.
- Tuning task granularity and data locality of data parallel
H-W. Loidl, P.W. Trinder, C. Butz
The performance of data parallel programs often hinges on two key
coordination aspects: the computational costs of the parallel
tasks relative to their management overhead -- task granularity;
and the communication costs induced by the proximity of tasks to their
data -- data locality. In data parallel programs both
granularity and locality can be improved by clustering,
i.e. arranging for parallel tasks to operate on related
sub-collections of data.
The GpH parallel functional language automatically manages most
coordination aspects, but also allows some high-level control of
coordination using evaluation strategies. We study the
coordination behaviour of three typical data parallel programs,
and find that while some can be improved by introducing
clustering evaluation strategies, others require that the
program is restructured. Thus, we introduce a new generic
Cluster class that allows clustering to be systematically
introduced, and improved by program transformation. In contrast to
many other parallel program transformation approaches, we
transform large executable programs and report performance
results on a 32-processor Beowulf cluster.
- Parallel functional programming at two levels of abstraction
Ricardo Peña, Fernando Rubio
The parallel functional language Eden extends Haskell with
expressions to define and instantiate process systems. These
extensions allow also the easy definition of skeletons as higher-order
functions. Parallel programming is possible in Eden a two levels:
Recursive programming and higher-order programming. At the lower
level, processes are explicitly created by using recursive
definitions, allowing the definition of skeletons. This is
uncommon, as most skeleton-based languages use an imperative
language to create new skeletons. At the higher level, available
skeletons are used to create applications or to define new
skeletons on top of the other ones. In this paper, we present
five skeletons, most of them well-known, covering a wide range of
parallel structures. For each one, several Eden implementations are
given, together with their cost models. Finally, examples of
application programming are shown, including predicted and actual
results on a Beowulf cluster.
Return to main HLPP2001 page.