CSE548: Project Proposal

Contrasting Hardware and Software Solutions to False Sharing on
Bus-Based Shared Memory Multiprocessors


Stefan Berg, Matthai Philipose, Frédéric Pighin and Ralf Schimkat


Department of Computer Science
University of Washington
Seattle, WA 98105 USA

{sgberg,matthai,pighin,schimkat}@cs.washington.edu

The issue under study

Contemporary architectures adopt a hierarchial view of memory to fulfil the abstraction of infinite memory: main memory is supplemented by smaller, faster caches. A primary concern then becomes to maximize the number of cache hits: as much of the working set of the program as possible should reside in the cache. In the case of shared memory multiprocessors, each processor has a local cache, leading to the additional requirement that the local caches should be coherent: two physical cache blocks having the same mapping onto virtual memory should not have contradicting contents.

Maximizing cache hits while maintaining coherency has been identified as a central problem in memory hierarchy design for shared-memory multiprocessors. Where cache blocks are more than one word in size, and blocks are the units in which cache space is manipulated (i.e. invalidated or read in from memory), false sharing may account for a significant fraction of unnecessary cache misses if naive cache coherency protocols are used. We compare three approaches to avoid the effects of false sharing:

The first scheme is a partial block invalidation scheme, with hardware support for maintaining invalidation information on a per-word basis. The second scheme does not require the extra one-bit-per-word cache hardware support, but still represents savings over a naive invalidation scheduling scheme because it defers non-local invalidation to the last possible moment allowed by the release-consistency model. The third scheme takes the viewpoint that the majority of false-sharing can be eliminated by static compile-time analysis followed by code transformation, and thus obviates the need for hardware support.

We plan to study under what conditions (i.e. for what kinds of programs) each of these approaches does the best, and then identify some optimal approach based on the two extreme approaches (i.e. one software-based and the other hardware-based).

Some questions are: How effective is the code-transformation approach in general? Specifically, is there a class of applications which does not transform well? How much would these applications benefit from the hardware support demanded by Dubois' methods? Does code transformations combined with hardware support buy significant added performance? Do any of these schemes solve false-sharing to the extent that significantly larger cache-block sizes are allowed?

Methodology of study

Our work will be based on a set of simulations.

We will simulate the execution of a workload (benchmarks) on a bus based shared memory multiprocessor. Our model will be a composed of a set of MIPS processors, each with a single level cache connected by a single bus. Since we believe cache size is an issue, we won't make the assumption of infinite cache size as is usually the case. Two versions of the benchmarks (unoptimized or optimized by Tor's source code transformations) will be executed while simulating the three hardware cache coherency protocols (WBWI, delayed consistency, and vanilla) in a row. As of now, we expect the vanilla protocol to be the On the Fly (OTF) protocol. These sets of simulations should allow us to:

and this under various assumption of the cache size, block size, memory latency and program memory access patterns.

Our comparisons will be based on the following metrics:

To carry out these simulations, we need several tools:

Our hypothesis

We conjecture that a clever hardware protocol should outperform a software solution. By definition, the software optimization is restricted to a static analysis of the program. We would expect a static analysis to be overly conservative, since the exact pattern of sharing cannot be analyzed at compile time. A concrete example would be the quick sort algorithm which splits its input array on various boundaries depending on its input. It seems that the only solution to avoid all possible cases of false sharing statically in this particular example would be to devote a cache line to each array element, decreasing spatial locality and wasting cache space. These two unintended side-effects result in more cache misses.

The software optimization eliminates false sharing partly by changing the placement of data in the cache and by padding memory cells to the boundaries of the cache blocks. This can result in decreased spatial locality, which in turn would decrease the hit ratio of a cache. Also, padding wastes valuable cache space, resulting in a smaller usable cache that can yield more replacement cache misses.

We conjecture further that the hardware schemes are so effective in reducing the number of false sharing misses that larger cache lines could be used. The size of a cache line is usually the result of a tradeoff. On the one hand a larger line allows to exploit spatial locality much better, resulting in fewer cold misses. On the other hand, the larger the cache line, the higher the probability that false sharing will occur. Since we think false sharing misses are almost completely eliminated by these protocols , the cache line should be driven by spatial locality towards larger size.

Distribution of Work

We have only decided on a tentative division of the workload as yet. Here it is: