August 4, 2006
Three Innovative Parallel Discrete Event
Complex discrete event simulations may require too much processing power and memory space to be run sequentially on a single computer. Parallel discrete event simulation (PDES) languages attempt to provide a solution. Although PDES languages have been one of the most active areas of simulation research during the past few years, PDES languages still face challenges in providing a significant performance advantage over sequential simulations, especially for fine-grained simulations.
The purpose of this paper is to investigate three of the most promising and innovative new PDES languages, to determine how successfully each of them solves the major challenges to developing high-performance parallel simulations, and to decide which language might be the best for various types of simulations. The methodology was to review the recent literature (since 2000) on parallel discrete event simulation, and the conclusions are based on the reports provided by the language developers.
The three PDES languages investigated are POSE, μsik, and Aurora—the first developed at the Department of Computer Science, University of Illinois at Urbana-Champaign, and the latter two developed at the College of Computing, Georgia Institute of Technology. The comparisons and research results are presented in the paper's conclusion.
2. POSE: Parallel Object-Oriented Simulation Environment
POSE, Parallel Object-oriented Simulation Environment, is a parallel simulation language designed specifically to improve the performance of fine-grained discrete event simulations. POSE is written in CHARM++, a C++ parallel programming system that is highly portable among parallel computing environments. Each logical process in a simulation is mapped to a C++ object (known as a chare), which CHARM++ automatically allocates to a processor. (Wilmarth and Kale, 2004)
2.1 Time Management in POSE
A parallel discrete event simulation created in POSE consists of a collection of logical processes (LPs) that can be run in parallel. Each LP is essentially a separate discrete event simulator that manages one or more simulation entities in the simulation model, maintains state information, and has a list of pending time-stamped events that it needs to process. An LP can schedule events for itself, receive events from other LPs, or assign events to other LPs.
The POSE time management system synchronizes the LPs so that each LP processes all its events in the correct (i.e. the timestamped) order. The requirement that each LP process its events in timestamped order is known as the local causality constraint. Basically, an LP conforms to the local causality constraint by repeatedly removing and processing the event with the smallest timestamp from its pending event list. The problem, however, is that after an LP begins processing an event it might receive an event from another LP with an even smaller timestamp, resulting in an out-of-order event execution, which is known as a causality error. (Wilmarth and Kale 2004) There are two main strategies for dealing with this problem: conservative synchronization and optimistic synchronization.
With conservative synchronization, before an LP begins processing its earliest pending event, it blocks until that event is safe—that is, until the LP is certain that no new event can be received that has a smaller timestamp. This strategy always maintains proper event execution order, but does so at the cost of periodically blocking and thereby limiting execution parallelism.
With optimistic synchronization, which is used in POSE, an LP proceeds to execute its earliest pending event even if that event is unsafe—that is, even if it is still possible to receive a new event with a smaller timestamp. However, before an LP begins processing an unsafe event, it saves sufficient state information (that is, takes a checkpoint) so that if at a later time it does receive a prior event (a straggler), it can roll back to its previous state and reprocess the events in correct timestamp order. This approach avoids blocking and results in greater execution parallelism, although saving state and rolling back increase overhead (both in time and in the memory required to store checkpoint information).
(Fujimoto, 2003; Fujimoto, 2001; Wilmarth and Kale, 2004; Wang, Turner, Yoke, Low, and Gan, 2004)
2.2 Optimization of Fine-Grained Simulations in POSE
A course-grained simulation consists of relatively large blocks of code (tasks) that can be run somewhat autonomously—that is, with a minimum of interactions with other tasks. Course-grained simulations are easy to parallelize. The tasks can be allocated to separate processors, where they can run with minimal synchronization and communication overhead. A fine-grained simulation consists of relatively small tasks that are fairly interdependent or coupled—that is, the tasks tend to interact frequently. Fine-grained simulations are difficult to run in parallel—synchronization and communication tends to be complex and costly. (Wilmarth and Kale, 2004)
According to POSE developers Wilmarth and Kale (2004), for parallel discrete event simulations, a good measure of granularity is the amount of time required to process an average event and the autonomy of the events. In a course-grained simulation an LP has relatively few, long-running events, which exchange a small number of events with other LPs. Parallel versions of these simulations achieve marked performance improvements. Many simulations, however, are inherently fine grained. In a fine-grained simulation, an LP tends to process many small events, which exchange numerous events with other LPs. Whichever of the two synchronization strategies the simulation uses, the event synchronization (time management) overhead for a fine-grained simulation can be so large that it can be difficult for a parallel simulation to achieve a shorter execution time than its sequential counterpart. (Wilmarth and Kale, 2004)
The POSE optimistic synchronization time-management system enhances fine-grained simulation performance by directly reducing event synchronization overhead. One way in which POSE reduces this overhead is by grouping events. That is, when an LP receives control, it executes an entire group of pending events, effectively increasing the grain size. Group handling of events reduces the per-event overhead. Group handling also reduces scheduling and context switching overhead. (Context switching among LPs is necessary because many LPs may run on a single processor.) And, finally, group handling allows events to run with a "warmed" memory cache. (Events processed by a single LP will tend to share data, and once this data is loaded into the memory cache, cache hits will predominate until another LP replaces this cache data with its own events' data).
Considering just these three performance advantages of group event handling, it would seem to be desirable to process as many events at once as possible. However, if a large group of events has been processed and a straggler event subsequently arrives that is earlier than all these events, the rollback overhead will be severe. Therefore, POSE limits the number of events that may be processed in a group. This limit is known as the speculative window.
A second way in which POSE reduces synchronization overhead in fine-grained simulations is to use an adaptive approach to optimistic synchronization that adjusts the size of the speculative window on a per-LP basis. If a particular LP has been receiving most of its events in order and has therefore had few rollbacks, POSE increases the size of that LP's speculative window, letting the LP process large groups of events. If, however, an LP has been receiving many straggler events and has therefore been invoking many rollbacks, POSE reduces the size of the LP's speculative window, thereby restricting the number of events the LP can process in a group. Overall, the adaptive approach permits the largest set of events to execute in a group without increasing rollback overhead.
How well does the POSE approach work? Wilmarth et al. (2005) successfully used POSE to create a very fine-grained simulation (BigNetSim) designed to predict the performance of parallel programs on large parallel machines (such as the IBM BlueGene). In the words of Wilmarth et al., "BigNetSim achieves excellent problem-size scaling, particularly for larger problem instances on greater numbers of processors."
(Wilmarth and Kale, 2004; Wilmarth et al., 2005)
3. μsik: A Micro-Kernel for Parallel/Distributed Simulation Systems
μsik (pronounced "mew-seek") is a language based on a library of C++ classes that provide a general-purpose application programming interface for creating parallel, distributed discrete event simulations. The language can be used on UNIX, Linux, and Windows systems, and the software can be downloaded from www.cc.gatech.edu/fac/kalyan/musik.htm.
μsik uses a micro-kernel approach. Traditionally, to develop a PDES language that uses a particular technique or set of techniques (for example, a particular synchronization strategy), an entire system is built from the ground up. Changing the techniques in a substantial way can require building a completely new system. In contrast, the micro-kernel approach isolates the invariant core that is common to different PDES techniques. This core is implemented as a micro-kernel, which is a generalized framework that can be used as a basis for building a variety of different PDES systems that use various techniques.
The micro-kernel provides a small set of core services: naming (providing a uniform way for LPs to identify each other), routing (forwarding events from one LP to another), and scheduling (allocating CPU cycles among LPs).
The μsik layer above the micro-kernel consists of system services, which use the core services and provide event scheduling support, communication support, timing, random number generation, and other types of support. It is straightforward to add a system service or to optimize one, without having to reconstruct the entire system. μsik is thus similar in design to an extensible operating system, where a service such as a particular file system can be added to an underlying core.
The μsik micro-kernel provides only a very basic, flexible event scheduling algorithm that does not dictate a particular event scheduling strategy. The precise way that events are scheduled is controlled by parameters that are set at the system service level. A system service can provide classical conservative event synchronization, classical optimistic event synchronization, or an entirely different event synchronization strategy (perhaps an innovative or experimental strategy tuned to a specific type of simulation). As a general optimization tactic, if the system services provide both conservative and optimistic scheduling, the micro-kernel will invoke optimistic event processing only when all conservative processing is currently blocking, thus maximizing the time spent executing events in proper order and minimizing rollbacks.
In most ways, however, the micro-kernel leaves the scheduling details to the system services. μsik is thus ideal for writing applications for challenging fine-grained simulations, where a custom synchronization strategy might be necessary to achieve high performance. The particular system service currently providing the synchronization strategy can even be changed during a simulation run to adapt to fluctuating conditions.
How successful is μsik? The developer, Perumalla (2005), has used it effectively for creating fine-grained physics simulations that are run on large cluster systems. In the developer's own words, "It is being successfully used in non-trivial applications with both conservative as well as optimistic modes."
4. Aurora: A Parallel and Distributed Simulation System
The Aurora Parallel and Distributed Simulation System is a discrete event simulation language designed for developing large-scale, web-based, parallel discrete event simulations. According to the developers, Park and Fujimoto (2006), a traditional PDES system is typically built as a monolithic application that runs on a specific parallel computer or grid, or on a federation of such computers. The resulting simulations tend to be highly structured, and thus intolerant of "dynamic" client behavior and node failures.
In contrast, Aurora uses a server/client model in which the server (also known as the master) distributes work to clients (also known as workers) as they request it. The interaction between the server and the clients uses open web service standards and conforms to the grid model that has been used in recent years for such general-purpose distributed computing projects as SETI@home. Because web services employ open protocols and universally readable data files, and because a client needs to follow a specific protocol only when contacting the server to download data or to return results, client computer languages, operating systems, and hardware architectures can be quite diverse.
An Aurora server can easily handle a client failure by simply reissuing the task to another client. Clients can readily be added or removed during a simulation run as their availability changes. And an Aurora simulation can be run as a background process on a client. Thus, in contrast to a more conventional PDES system, the server places few demands on the client computers and consequently has little reliance on dedicated or reserved computing resources.
Because of the large number of machines potentially available on a web grid, Aurora focuses on achieving high throughput by maximizing the number of machines that it can use, rather than by optimizing performance on a specific dedicated parallel computer system.
(Park & Fujimoto, 2006; Valchanov, Ruskov, and Ruskova, 2004)
4.1 Aurora Server and Client Architecture
In Aurora, the logical processes (LPs) are grouped into independent work units. An Aurora work unit is the atomic unit that is transferred between server and client. The Aurora server provides a communications interface, LP management, and time management. The current version of Aurora provides only a traditional conservative synchronization mechanism.
An Aurora client also provides a communications interface. A client repeatedly requests and downloads one work unit at a time from the server using a web service request, runs the specified simulation code conforming to the time management scheme, and then loads the results back to the server.
Park and Fujimoto (2006) report good performance for the Aurora language for large-grained simulations when Aurora is used with gSOAP, an open source web services toolkit. However, the technology for master/worker PDES systems is yet in its infancy, and Aurora is still largely in the prototype stage. Its developers plan to add many enhancements, such as optimistic synchronization and more robust server and client fault tolerance.
(Park & Fujimoto, 2006; Valchanov, Ruskov, and Ruskova, 2004)
All three of the discrete event simulation languages investigated in this paper are designed for creating high performance, parallel discrete event simulations. The languages differ primarily in how distributed the resulting simulations are, how the languages implement time management, and how well the languages work for developing fine-grained simulations.
The language most suited for creating highly distributed simulations is clearly Aurora. Aurora is designed to achieve high performance by using large numbers of independent, heterogeneous client computers connected by an Internet grid. It can create strongly fault tolerant, flexible, and extensible distributed systems. In contrast, POSE and μsik, are designed primarily for creating simulation systems that run on more tightly coupled systems, such as parallel processing computers and clusters.
Regarding time management, Aurora achieves its performance through the sheer number of computers that it can reliably connect, rather than by optimizing its synchronization algorithms (which are critical primarily for fine-grained simulations run on coupled systems). Accordingly, the current version of Aurora provides only a traditional conservative synchronization mechanism. POSE provides only an optimistic synchronization algorithm, although it is a highly optimized one, which achieves high performance by grouping events and by adapting the algorithm to the needs of each logical process. μsik provides the most flexible time management system, allowing the developer to add virtually any type of conservative, optimistic, or innovative synchronization service; and even permitting the synchronization service to be changed during a simulation run to meet the fluctuating needs of the logical processes.
Regarding the ability to create fine-grained simulations, Aurora is simply not appropriate for this type of simulation. A typical fine-grained simulation with its extensive logical process interdependence and communication needs would not be suitable for the highly distributed environment of an Aurora simulation. In sharp contrast, both POSE and μsik have been designed specifically for creating high performance fine-grained simulations. POSE achieves its fine-grained performance by providing a highly optimized optimistic synchronization algorithm. At least in theory, a μsik simulation could achieve even higher fine-grained performance since it could be programmed to use an algorithm similar to POSE's, or to switch if appropriate to an algorithm that is even more suitable for the current simulation or simulation phase.
Thus, the best choice among these three languages depends upon the type of parallel simulation that is being created. For a large-grained simulation that can be divided into fairly sizeable, semi-autonomous tasks, which can be parceled out to independent clients, Aurora would clearly be the language of choice. For a typical fine-grained simulation that can benefit from a highly optimized, traditional, optimistic synchronization algorithm, POSE might be the best language and it could probably be used with little customization. For a "problem" fine-grained simulation—that is, one that requires a finely tuned, customized, or variable synchronization algorithm—μsik might be the best choice, even though it could require developing new extensions to the micro-kernel.
6. Reference List
Fujimoto, R.M. (2001). Advanced tutorials: Parallel simulation: parallel and distributed simulation systems. In B.A. Peters, J. S. Smith, D.J. Medeiros, and M.W. Rohrer (Eds.), Proceedings of the 2001 Winter Simulation Conference. (pp. 147-157). Winter Simulation Conference.
Fujimoto, R.M. (2003). Advanced tutorials: Parallel simulation: distributed simulation systems. In S. Chick, P. J. Sánchez, D. Ferrin, and D. J. Morrice (Eds.), Proceedings of the 2003 Winter Simulation Conference (pp. 124-134). Winter Simulation Conference.
Park, A. & Fujimoto, R.M. (2006). Aurora: an approach to high throughput parallel simulation. In Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation. Los Alamitos, CA: IEEE Computer Society.
Perumalla, K.S. (2005). μsik—a micro-kernel for parallel/distributed simulation systems. In Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation. Los Alamitos, CA: IEEE Computer Society.
Valchanov, H., Ruskov, I., & Ruskova, N. (2004). Application aspects of computer systems and technologies: A communication kernel of a cluster system for distributed simulation. In Proceedings of the 5th International Conference on Computer Systems and Technologies (pp. 1-5). New York: ACM Press.
Wang, X., Turner, S.J., Yoke, M., Low, H., Gan, B.P. (2004). Applications: Optimistic synchronization in HLA based distributed simulation. In Proceedings of the 18th Workshop on Parallel and Distributed Simulation (pp. 123-130). New York: ACM Press.
Wilmarth, T.L. & Kale, L.V. (2004). POSE: Getting Over Grainsize in Parallel Discrete Event Simulation. In Proceedings of the 2004 International Conference on Parallel Processing. Los Alamitos, CA: IEEE Computer Society.
Wilmarth, T.L., Zheng, G., Bohm, E.J., Mehta, Y., Choudhury, N., Jagadishprasad, P., Kale, L.V. (2005). Performance prediction using simulation of large-scale interconnection networks in POSE. In Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation. Los Alamitos, CA: IEEE Computer Society.