Recent Developments in Mutual Exclusion for Multiprocessor Systems
Multiprocessor systems are becoming increasingly prevalent, and yet providing mutual exclusion for the potentially large number of cooperating processes found on today's multiprocessors can be problematic. Mutual exclusion on multiprocessors can reduce overall computational efficiency due to contention for shared lock variables and other problems. Mutual exclusion can also increase power consumption on multiprocessor systems, which is becoming an increasingly important consideration.
The purpose of this paper is to find out how these multiprocessor mutual-exclusion difficulties can be overcome by using methods that have been proposed in the research literature. The paper focuses on recent developments, but also discusses some of the seminal work that has lead up to more current concepts. The paper concludes with a brief investigation of the relatively new idea of eliminating mutual exclusion altogether in favor of various forms of lock-free synchronization.
2. Computational Efficiency of Mutual Exclusion on Multiprocessor Systems
If multiple processes or threads share data or another resource, the traditional way to prevent race conditions that can result in inconsistent data is to enforce mutual exclusion within the critical sections of code that manipulate the shared data or resource. User-level programs typically establish mutual exclusion simply by using synchronization facilities, such as semaphores, that are provided by the operating system kernel or thread library. Therefore, the task of actually implementing mutual exclusion normally falls on the kernel.
In a uniprocessor system, the kernel might be able to create mutual exclusion by temporarily turning off the preemption of processes or by momentarily disabling interrupts, so that only the current process can continue to run. On a multiprocessor system, however, even if process preemption or interrupts are disabled on every processor, multiple concurrent processes could still run and mutual exclusion would not be achieved. On multiprocessor systems, therefore, enforcing mutual exclusion normally requires using some type of lock. To enter a critical section, a process must acquire the corresponding lock, and no other process can acquire the lock and enter its critical section until the original process exits the section and releases the lock. (One type of multiprocessor that might not require using a lock to create mutual exclusion in kernel code is an asymmetric, master-slave multiprocessor system, where a single master processor runs all kernel code and can therefore use standard uniprocessor methods for creating mutual exclusion [Goble and Marsh, 1981].)
Using locks on a multiprocessor, however, can be problematic. If the process holding the lock is preempted, the processes that are waiting for the lock can be unduly delayed. Also, multiple processors accessing a single shared lock variable can cause significant memory contention as well as traffic over the shared bus or interconnection network, which can degrade performance and limit the scalability of the system. And finally, most locks do not give precedence to high priority tasks and can thus impede real-time processes.
The following sections discuss various solutions to these problems, which have emerged from recent research. Most of the approaches apply primarily to spin locks (as opposed to locks in which waiting processes sleep). Although spin locks tend to waste processor cycles, they can be quite useful and are commonly used on multiprocessors for several reasons. First, for a very short critical section, the waiting time for a lock might be less than the time required to put a process to sleep and add it to a waiting list, and therefore a low-overhead spin lock can be more efficient than a sleeping lock. To garner this advantage, the critical section must be within the kernel to ensure its shortness. Second, on a multiprocessor system, while a process spins on one processor, other processes can continue to accomplish useful work on other processors, including the work by the lock-holding process of completing the critical section and releasing the lock! (In contrast, on a uniprocessor, while one process spins, no other process can accomplish work or release the lock.) Of course, on a multiprocessor other processes on the same processor as the spinning process can be held up, so it is still important to minimize the length of the critical section. (Johnson and Harathi, 1997)
2.1 Preemption-Safe Locking
To avoid delaying processes waiting for a mutual-exclusion lock, the critical section should of course be kept as short as possible. Holman and Anderson (2002) point out that if the task holding the lock is preempted, the preemption can effectively lengthen the critical section by a significant amount of time. This effect is especially severe under certain scheduling algorithms (such as Pfair) if the lock-holding preempted process has a relatively low priority and is therefore likely to be preempted for a long time. The lengthening of the critical section tends to be particularly harmful for spin locks, which can waste processor cycles. An obvious solution to the preemption problem is to suppress the preemption of any process that holds a lock—that is, to provide a preemption-safe lock.
Holman and Anderson (2002) propose a lock freezing approach to providing preemption-safe locks for short critical sections. They define a short critical section as one that can easily be executed within a single scheduling time quantum, and claim that most critical sections that manipulate common data objects such as queues or linked lists are in fact quite short relative to the length of the scheduling quantum. Therefore, to prevent the preemption of a process while it executes a short critical section, we need only ensure that the process begins executing the critical section sufficiently early in its time slice. To achieve this assurance, the lock is frozen, so that it cannot be acquired during the last portion of the time slice. The frozen portion of the time slice must be at least as long as the critical section. (The start of the frozen period is signaled by the triggering of a timer.) Although freezing the lock does delay processes from entering the critical section, the delays occur only during the tail-end of each time slice and are much less significant than the delay that would occur if a process acquired the lock so late in its time slice that it is unable to finish executing the critical section and to release the lock before the time slice expires and the process is preempted.
An important feature of Holman and Anderson's (2002) lock freezing approach for providing preemption-safe locks is that the procedure is transparent to processes. A process either gets the lock or—if the lock is frozen or owned by another process—waits for it. A process does not need to incur the overhead and complexity of explicitly notifying the kernel not to preempt it. In an earlier paper, Rhee and Lee (1996) proposed an alternative scheme for avoiding the need for processes to request non-preemption in multiprocessor systems. Their approach is a recovery-based protocol for spin locks that allows preemptions to occur freely, but in the relatively rare case that a lock-holding process is actually preempted, "recovers" the process from preemption. The kernel accomplishes the recovery by immediately rescheduling the preempted process.
Other studies that predate the work of Holman and Anderson (2002) include the experiments of Michael and Scott (1997), which confirm the efficiency of preemption-safe locks on time-sliced multiprocessor systems. Their tests show that when used to protect specific data structures—namely, stacks, FIFO queues, priority queues, and counters—preemption-safe locks significantly outperform conventional (non-preemption-safe) locks. Interestingly, their study also shows that for protecting these specific data structures, lock-free synchronization is even more efficient than the use of preemption-safe locks. (Lock-free synchronization is discussed later in the paper, in "Lock-Free Alternatives To Mutual Exclusion On Multiprocessor Systems.")
2.2 Controlling Lock Variable Contention
The use of spin locks on shared-memory multiprocessor systems can cause a system-wide performance degradation due to memory contention for the single shared lock variable (known as a hot spot) and the concomitant heavy traffic over the shared bus or interconnection network. Entering a critical section through a spin lock normally occurs in two phases: the waiting phase and the lock acquisition (arbitration) phase. All processes that are in the waiting phase repeatedly test the value of—that is, spin on—the lock variable. On a cache-coherent system, because every processor has its own local cache, each waiting process actually spins on a local cached copy of the lock variable. During the waiting phase, the cache protocol thus prevents memory contention and traffic over the bus or network. However, when the lock-holding process releases the lock by resetting the value of the shared lock variable in memory, all waiting processes detect the reset value and simultaneously enter the lock acquisition phase, in which each process uses an atomic get-and-set instruction (such as Read-Modify-Write) to attempt to acquire the lock. This simultaneous attempt by multiple processes to modify the value of the single shared lock variable causes the cache coherence protocol to generate a high level of bus or network traffic. The effect of caching during the lock acquisition phase is thus the opposite of its effect during the waiting phase, and caching during lock acquisition can restrict application scalability—that is, limit the number of processes that can be run in parallel. (Nikolopoulos and Papatheodorou, 2000).
Nikolopoulos and Papatheodorou (2000) propose the use of a hybrid primitive to reduce memory contention and interconnection network traffic problems in distributed shared-memory multiprocessors with directory-based cache coherence. During the waiting phase (and also during the final lock release phase), the hybrid primitive uses a normal cached lock variable, which—as explained above—minimizes memory contention and network traffic. However, during the lock acquisition phase, when the cache coherence protocol would normally generate a spate of network messages, a hybrid primitive uses hardware support to atomically modify a separate uncached lock variable, which bypasses the cache coherence protocol and requires sending only a single round-trip network message. The term hybrid refers to the use of both cached (software managed) and uncached (hardware managed) lock variables and memory instructions.
An earlier approach for minimizing the memory contention and network traffic that results on a distributed shared-memory multiprocessor when multiple processors access a single shared lock variable is to use a set of lock variables. In a scheme presented in an often-cited paper by Mellor-Crummey and Scott (1991), all processes initially test a single shared lock variable in a remote memory module. Although this testing causes initial contention and network traffic and the shared lock is considered a hot spot, every process that finds the lock set, and must therefore wait, then appends itself to a linked list of waiting processes (maintained in FIFO order) and spins on a separate lock variable in its local memory, avoiding further contention and network traffic during the waiting period. Assuming that other processes are waiting when the lock-holding process exits the critical section, the exiting process resets the value of only the private, contention-free lock variable belonging to the process at the head of the linked waiting list, effectively waking up that process and passing it the lock.
Fu and Tzeng (1997) refined the Mellor-Crummey and Scott approach for a distributed shared-memory multiprocessor by providing an entire tree-structure of shared locks (each lock in a different remote memory module) rather than a single shared lock, so that when multiple processes initially test the lock, the lock accesses are spread out over a group of shared locks rather than being concentrated on a single lock variable, thereby eliminating the hot spot and significantly reducing initial lock-testing contention and the resulting network traffic. Although the details of their lock tree are beyond the scope of this paper, it works somewhat like a tournament, where the winner of the lock at one tree level is eligible to compete for the lock at the next higher level, and only the process that acquires the lock at the root of the tree can enter the critical section. The lock tree also serves as the list of processes that are waiting to acquire the lock, which eliminates the need for a process to explicitly add itself to a linked waiting list (an activity that causes further network traffic). As with the Mellor-Crummey and Scott algorithm, a process that fails to acquire the lock then waits by spinning on a contention-free private lock in its local memory. The Fu and Tzeng algorithm thus minimizes memory contention and network traffic during the entire waiting phase, and simulations indicate that their approach does in fact achieve significantly better performance than the Mellor-Crummey and Scott algorithm and other earlier methods.
2.3 Using Prioritized Multiprocessor Spin Locks
Another potential problem with spin locks used to create mutual exclusion on multiprocessors is that a spin lock is often not fair. That is, it typically does not permit a high-priority process to precede a low-priority process into the critical section, and might even cause a high-priority process to perform work, such as releasing and passing the lock, on behalf of a low priority process (thereby effectively speeding up the low-priority process at the expense of the high-priority process). As an example, the Mellor-Crummey and Scott algorithm discussed in the previous section grants lock requests in simple FIFO order rather than according to process priority.
To deal with this problem, Johnson and Harathi (1997) propose a prioritized multiprocessor spin lock that not only minimizes contention like the algorithms described previously, but also guarantees that the highest priority waiting process always receives the lock. To achieve this guarantee, the list of waiting processes is maintained in process priority order (except possibly for the record at the head of the queue, which stores the current lock holder). The priority order of the queue is maintained by the acquire-lock operation (when the acquiring process is blocked anyway), and the algorithm assumes that the relative priorities of waiting processes do not change. A prioritized lock, such as the one proposed by Johnson and Harathi, is especially important for systems that perform critical tasks and for real-time multiprocessor kernels.
3. Energy Efficiency of Mutual Exclusion on Multiprocessor Systems
In recent years, making computer systems more energy efficient has become extremely important, not only for conserving battery life in mobile devices but also for alleviating heat dissipation problems and energy consumption in large computing centers. Most of the initial research focused on the energy performance of uniprocessor systems or of individual processors in multiprocessor systems. However, a recent trend in synchronization research has been to examine the energy implications of multiprocessor synchronization methods in an attempt to reduce the overall energy consumption of cooperating processes. (Li, Martinez, and Huang, 2004)
In an ideal multiprocessor situation where each processor runs a single process, what a process does while it waits for a lock (for example, whether it spins or sleeps) does not affect the processes that are still executing code and therefore does not influence the overall computational efficiency of the system. However, a spinning process (or a sleeping process if the scheduler spins while the process sleeps) does waste power. Accordingly, Li et al. (2004) propose a synchronization system in which a waiting process switches its processor to a low-power sleep state until the process is ready to resume running. The primary challenge of such a system is that it must be able to estimate how long a process will need to wait (to enter a barrier or acquire a spin lock, for example), so that its processor can be woken up in time to avoid an execution delay. The estimated waiting time will also influence the choice of the particular processor sleep state that is activated. According to the Advanced Configuration and Power Interface (ACPI) guidelines, a processor may have several low-power sleep states, which vary in the degree of power saving. A lower-power sleep state will normally have a higher transition time for switching the state on and off. A particular sleep state is suitable only if the process wait time is longer than the total state transition time by a sufficient margin. (For a short process wait, there might not be a suitable sleep state, so the processor should be left in its normal mode.) The work of Li et al. applies specifically to synchronization barriers and they call their proposed system a thrifty barrier. However, they plan to extend their work to include mutual-exclusion locks, where the same principles apply.
In a more recent study, Moreshet, Bahar, and Herlihy (2006) discovered that another way to reduce power consumption due to synchronization on multiprocessors is to synchronize data-sharing processes by using transactional memory rather than locks and mutual exclusion. Transactional memory is briefly discussed in the next section. Moreshet et al. discovered that as long as the transactions (the sections of code that manipulate shared data) are reasonable in size, the use of transactional memory can result in lower power consumption than the use of locks. The computing power saved by eliminating waiting for locks is apparently greater than the extra power that is consumed by an occasional transaction rollback when a conflict occurs. The proportion of energy consumed by synchronization, and therefore the power savings that can be realized by using transactional memory synchronization, are highly application dependent.
4. Lock-Free Alternatives to Mutual Exclusion on Multiprocessor Systems
The previous sections have discussed some of the important problems inherent in using lock-based mutual exclusion to protect shared data or other resources on multiprocessor systems. One general approach to solving these problems, which has been prevalent in research since around the year 2000, is to abandon mutual exclusion and locks altogether. This alternative approach is known as lock-free or non-blocking synchronization, and it allows multiple processes to enter a critical section simultaneously. According to Holman and Anderson (2002), lock-free synchronization works best for protecting relatively simple data objects such as queues and buffers, while lock-based synchronization remains the method of choice for synchronizing access to complex data objects or external devices. A study by Tsigas and Zhang (2001) indicates that for a wide range of applications, non-blocking synchronization performs as well as or better than lock-based synchronization, and that for certain applications (such as the Radiosity benchmark) it performs significantly better. The essential problem, however, is how to synchronize access to shared data structures in order to prevent race conditions and protect the integrity of the data, without mutual exclusion. Several approaches have been suggested.
A basic method for implementing lock-free synchronization is the use of a retry loop. When a process reaches a critical section, it simply enters the section and begins accessing the shared resource optimistically, without acquiring a lock. After it has completed the critical section, it tests the shared data for evidence of interference from another process. If interference is found, the critical section code has failed and is executed again; otherwise, the code is successful and the process exits the critical section. No kernel support is required, but a challenge on real-time systems is to ensure that the number of retries has a reasonable bound. (Devi, Leontyev, and Anderson, 2006)
One variety of the retry loop method is the use of transactional memory, which is similar to the technique that databases employ for serializing database transactions. With this method, a thread announces the start of an atomic transaction, attempts a sequence of operations on the shared data (the transaction), and then either commits the transaction so that the operations take effect, or discards the transaction and tries again, depending on whether it detects interference from another process. Researchers have proposed both hardware and software approaches to implementing transactional memory. (Moreshet et al., 2006; Herlihy, Luchangco, Moir, and Scherer, 2003)
How then can recent innovations be used to solve the problems inherent in enforcing mutual exclusion on multiprocessor systems?
The excessive process delays that can result when a process that holds a lock is preempted can be controlled either by making the lock unavailable to a process (that is, by freezing the lock) during the last phase of the process time slice (when the process would not have time to complete its critical section before preemption), or by allowing a lock-holding process to be preempted but then immediately rescheduling it (that is, recovering the process). The memory contention and network traffic that results from multiple processes accessing a shared lock variable can be reduced either by using a hybrid primitive that employs both a cached and an uncached lock variable to minimize network traffic, or by using an entire set of lock variables to disperse the contention. And the problem of honoring the priorities of processes waiting for a lock (that is, maintaining fairness) can be solved by storing waiting processes in a list that is sorted in priority order.
Also, the computing power that is consumed during synchronization activities can be reduced either by switching a waiting processor to a low-power sleep state rather than allowing processes to waste energy while waiting, or by using the transactional-memory synchronization method rather than having processes wait for a lock.
Although contemporary research has provided solutions to many of the problems in mutual exclusion, the recent popularity of lock-free alternatives is a testimony to the persistence of at least some of the lock-based mutual-exclusion challenges.
6. Reference List
Devi, U.C., Leontyev, H., & Anderson, J.H. (2006). Efficient synchronization under global EDF scheduling on multiprocessors. In Proceedings of the 18th Euromicro Conference on Real-Time Systems (pp. 75-84). Washington, DC: IEEE Computer Society.
Fu, S.S. & Tzeng, N.F. (1997). A circular list-based mutual exclusion scheme for large shared-memory multiprocessors. In IEEE Transactions on Parallel and Distributed Systems, Volume 8, Issue 6 (pp. 628-639). Piscataway, NJ: IEEE Press.
Goble, G.H. & Marsh, M.H. (1981). A dual processor VAX 11/780 (Technical Report TR-EE 81-31). Los Alamitos, CA: IEEE Computer Society Press.
Herlihy, M., Luchangco, V., Moir, M., & Scherer, W.N. (2003). Software transactional memory for dynamic-sized data structures. In Proceedings of the Twenty-Second Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC), 92-101. Sunnyvale, CA: Sun Microsystems, Inc.
Holman, R. & Anderson, J.H. (2002). Locking in Pfair-scheduled multiprocessor systems. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02). Washington, DC: IEEE Computer Society.
Huang, T.L. & Shann, C.H. (1998). A circular list-based mutual exclusion scheme for large shared-memory multiprocessors. In IEEE Transactions on Parallel and Distributed Systems, Volume 9, Issue 4 (pp. 414 - 416). Piscataway, NJ: IEEE Press.
Johnson, T. & Harathi, K. (1997). A prioritized multiprocessor spin lock. In IEEE Transactions on Parallel and Distributed Systems, Volume 8, Issue 9 (pp. 926-933). Piscataway, NJ: IEEE Press.
Li, J., Martinez, J.F., & Huang, M.C. (2004). The Thrifty Barrier: energy-aware synchronization in shared-memory multiprocessors. In Proceedings of the 10th International Symposium on High Performance Computer Architecture (pp. 14-23). Washington, DC: IEEE Computer Society.
Mellor-Crummey, J.M. & Scott, M.L. (1991). Algorithms for scalable synchronization on shared-memory multiprocessors. In ACM Transactions on Computer Systems (TOCS), Volume 9, Issue 1 (pp. 21-65). New York: ACM Press.
Michael, M.M. & Scott, M.L. (1997). Relative performance of preemption-safe locking and non-blocking synchronization on multiprogrammed shared memory multiprocessors. In Proceedings of the 11th International Symposium on Parallel Processing (pp. 267-273). Washington, DC: IEEE Computer Society.
Moreshet, T., Bahar, R.I., Herlihy, M. (2006). Energy implications of multiprocessor synchronization. In Proceedings of the Eighteenth Annual ACM Symposium on Parallelism in Algorithms and Architecture (p. 329). New York: ACM Press.
Nikolopoulos, D.S. & Papatheodorou, T.S. (2000). Fast synchronization on scalable cache-coherent multiprocessors using hybrid primitives. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing (pp. 711-719). Washington, DC: IEEE Computer Society.
Rhee, I. & Lee, C.Y. (1996). An efficient recovery-based spin lock protocol for preemptive shared-memory multiprocessors. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing (pp. 77-86). New York: ACM Press.
Tsigas, P. & Zhang, Y. (2001). Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors. In Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (pp. 320-321). New York: ACM Press.
 This approach could be used to protect critical sections of kernel code, including critical sections within the kernel implementation of semaphore functions called by user-level programs. The kernel would certainly not allow preemption or interrupts to be disabled during the execution of an entire user-level critical section.
 In a follow-up study, Huang and Shann (1998) discovered that the Fu and Tzeng algorithm is subject to a race condition that can lead to deadlock in rare instances. Huang and Shann provide a modified version of the Fu and Tzeng algorithm, but confirm Fu and Tzeng's (1997) original performance measurements.