Challenges Addressed by the Application Representation

Next: An Example of Task Up: The Application Representation in Previous: The Dynamic Task Graph:

Challenges Addressed by the Application Representation

As noted in Section 2, there are a few major challenges that must be addressed if the application representation is to be used for large, real-world programs:

The size and scalability of the representation for large problems;
Representing adaptive codes; and
Computing the representation automatically with a parallelizing compiler.

Below, we briefly discuss how the design of the representation addresses these challenges. The compiler techniques for computing the representation are described in more detail in [4].

To address size and scalability, our main solution is to use the static task graph as the basis of the application representation. The STG is a concise (symbolic) representation whose size is only proportional to the size of the source code, and independent of the degree of parallelism or the size of the program's input data set. The STG can be efficiently computed and stored even for very large programs. Furthermore, we believe the STG can suffice for enabling many modeling approaches including supporting efficient compiler-driven simulation [2], abstract analytical models such as LogP and LogGP [1], as well as sophisticated hybrid models, e.g., combining processor simulation with analytical or simulation models of communication performance [1].

Some detailed modeling techniques (e.g., models that capture sophisticated task scheduling strategies [5], however, require the dynamic task graph. One approach to ensure that the size of the DTG is manageable for very large problems is to use the condensed DTG defined earlier. Informally, this makes the size of each parallel phase proportional to the number of physical processors instead of being a function of the degree of fine-grain parallelism in the problem. For an example, see Section 3.5. The condensed graph seems to be a natural representation for message-passing codes because the communication (which also implies synchronization) is usually written explicitly between processes. Furthermore, message-passing programs typically use static scheduling so constructing the condensed graph is not difficult.

For many shared memory programs, however, there are some significant tradeoffs in using the condensed DTG because the condensed graph depends on the actual allocation of fine-grain tasks to threads. Most importantly, the condensed graph would be difficult construct for dynamic scheduling techniques where computing the allocation would require a detailed prediction of the execution sequence of all tasks in the program. For shared-memory programs, this be a significant drawback because such programs sometimes use sophisticated dynamic task scheduling strategies and there can be important performance tradeoffs to be evaluated in choosing a strategy that achieves high performance [5].

An alternative approach for computing the DTG for very large codes is to instantiate the graph ``on the fly'' during the model solution, instead of precomputing it. (This is similar to the runtime instantiation of tasks in graphical parallel languages such as CODE [14].) This approach may be too expensive, however, for simple analytical modeling since the cost of instantiating the graph may greatly outweigh the time savings in using simple analytical models.

The second main challenge, namely supporting adaptive codes, arises because the parallel behavior of such codes depends on intermediate computational results of the program. This could mean that a significant part of the computation has to be executed to construct the DTG and to estimate communication and load balancing parameters for intermediate stages of execution. Although the STG is independent of runtime results, any modeling technique that uses the STG would have to account for the runtime behavior in examining and using the STG and would be faced with the same difficulty (e.g., see [2]). There are two possible approaches to this problem. First, for any execution-driven modeling study, we can determine the runtime parallel behavior on the fly from intermediate results of the execution. Alternatively, for models that require the DTG, we can precompute and store the DTG during an actual execution of the program, and measure the values of the above parameters. Both these approaches would require significant additional compiler support to instrument the program for collecting the relevant information at execution time.

Finally, an important goal in designing the application representation, as mentioned earlier, is to be able to compute the representation automatically using parallelizing compiler technology. The key to achieving this goal is our use of a static task graph that captures the parallel structure of the program using extensive symbolic information and static control-flow information. In particular, the use of symbolic integer sets and mappings is crucial for representing all the dynamic instances of a task node or a task graph edge. The ability to synthesize code directly from this representation is valuable for instantiating task nodes and edges for a given program input.

We have extended the Rice dHPF compiler infrastructure [3] to construct the static task graph for programs in High Performance Fortran (HPF), and to instantiate the dynamic task graph [4]. This implementation was used in a successful collaboration with the parallel simulation group at UCLA. This work has led to a very substantial improvement to the state of the art of parallel simulation of message passing programs [2]. The compiler implementation and its use for supporting efficient simulation are described briefly in Section 4.

Next: An Example of Task Up: The Application Representation in Previous: The Dynamic Task Graph:

Rizos Sakellariou 2000-09-15