Study of Relative Performance Impact: Coherence Protocol vs. Network Speed |
college |
topic: |
ECE757 (Multiprocessor Computer Architecture), Univ. of Wisconsin - Madison |
|
Architects and implementors must know design trade-offs to
design successful systems. Two important aspects of
multiprocessor performance are network quality and
coherence protocol. We study the performance impact of
coherence protocol choice (MSI vs. MESI) as compared to the
performance impact of the high-speed network. A higher
quality network is either wider, or has less latency, or both.
We find that in the majority of cases, network quality
dominates the effects of coherence protocol - that is,
lower-quality networks with the MESI protocol do not yield
faster runs that higher-quality networks with the MSI
protocol. We also note that program characteristics can
affect which protocol performs better. |
formats: |
Adobe PDF (158.9kB), PostScript (233.6kB), TeX (11.5kB)
|
1998-05-01 |
quality 5 |
% -*- latex -*-
\documentclass[12pt]{report}
% \setlength{\hoffset}{-1in}
% \setlength{\voffset}{0in}
% \setlength{\topmargin}{0pt}
% \setlength{\oddsidemargin}{0in}
% \setlength{\headheight}{0pt}
% \setlength{\headsep}{0pt}
% \setlength{\textheight}{9.00in}
% \setlength{\textwidth}{6in}
\usepackage{psfig}
\begin{document}
\begin{titlepage}
\title{Study of Relative Performance Impact: Coherence Protocol vs. Network Speed
\footnote{Completed as semester project for CS/ECE 757, Prof. Mark Hill}}
\author{Zak Smith \and Mostafa Arifin}
\maketitle
\begin{abstract}
Architects and implementors must know design trade--offs to
design successful systems. Two important aspects of
multiprocessor performance are {\it network quality} and
{\it coherence protocol}. We study the performance impact of
coherence protocol choice (MSI vs. MESI) as compared to the
performance impact of the high--speed network. A higher
quality network is either wider, or has less latency, or both.
We find that in the majority of cases, {\it network quality
dominates the effects of coherence protocol} --- that is,
lower--quality networks with the MESI protocol do not yield
faster runs that higher--quality networks with the MSI
protocol. We also note that program characteristics can
affect which protocol performs better.
\end{abstract}
\end{titlepage}
\newpage
\tableofcontents
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \baselineskip = 22pt
\chapter{Introduction}
\section{Topic Importance}
In order for a machine to achieve high performance within a
cost constraint, architects and implementors must know the
benefits and trade-offs for different design decisions.
The performance of the interconnection network is key in
multiprocessor systems. In all but ``embarrassingly
parallel'' applications, communication between processing
elements provides those elements with the data they need to
continue processing.
Our study examines three parameters which affect network
performance: coherence protocol, flit delay, and flit size.
We believe this type of study is important to verify
theoretical models with real benchmark results.
\section{Hypothesis}
During lecture, it was noted that a good implementation of the
interconnection network often dominates other factors such as
network topology. Since significant time was spent
discussing coherence protocols [1], we decided to determine
the relative performance impact of coherence protocol as
compared to ``quality of network implementation.'' Thus, our
hypothesis:
{\it A good network implementation dominates the effects of
choosing a specific coherence protocol. }
\section{Definitions}
\subsection{Flit Delay}
The flit delay is the time it takes for a flit to pass through
a network switch, relative to the clock speed of the
processor. For example, if the flit delay was 2, and the
processor was running at 300 Mhz, it would take 6.6 ns for a
flit to propagate through a switch. This primarily affects latency.
\subsection{Flit Width}
The flit width specifies the number of bytes in each network
flit, which is equivalent to the width of the network. This
primarily affects bandwidth.
\subsection{Coherence Protocol}
The two coherence protocols we look at are the MSI and MESI
protocols discussed in lecture and in [1].
\begin{tabbing}
MESI \= ~~~~~~~~~~~~~~ \= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\
M \> Modified \> This is the only copy, dirty. \\
E \> Exclusive \> Ensured to be the only RO copy. \\
S \> Shared \> One of two or more RO copies. \\
I \> Invalid \> Block not present. \\
\end{tabbing}
\begin{tabbing}
MSI \= ~~~~~~~~~~~~~~ \= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\
M \> Modified \> This is the only copy, dirty. \\
S \> Shared \> One of one or more RO copies. \\
I \> Invalid \> Block not present. \\
\end{tabbing}
Thus the primary difference between the two is that, for MSI,
if a processor reads a block known to no other processor and
then writes it, it must notify the directory (S to M). For
MESI, this is not the case: a processor that reads the block
first has it in Exclusive and can move to Modified without any
traffic to the directory (E to M).
At first glance, it would appear that MESI would perform
better, although as we will see later, there can be
implementation issues which make this not the case.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Methods}
\section{Simulator Model}
We chose to use RSIM [7,3], an ILP multiprocessor simulator
developed at Rice University. The system simulates a
collection of workstation--like nodes connected in a 2d mesh
interconnection network, with CC-NUMA based on a distributed
directory.
We simulated a 16 processor system to control runtime of our
simulations.
\section{Search-space}
One of the parameters we want to change is ``quality of
network implementation.'' We decided this is basically a
combination of flit delay (related to the clock speed of the
network), and the flit size (width of the network). In other
words, the ``best'' network will have the fastest clock speed
and the widest data-path. The ``worst'' network will have the
slowest clock speed and the the narrowest data-path.
To cover a range of network qualities, we chose to use flit
widths of {2, 4, 8} bytes, and flit delays of {2, 4, 8}
processor clocks. Thus for each benchmark, we have 18
data-points:
\vskip .1in
\begin{tabular}{|r||c|c|c|} %%%%%%%%%%%
\hline
& FD2 & FD4 & FD8 \\ \hline \hline
FS8 & MESI, MSI & MESI, MSI & MESI, MSI \\ \hline
FS4 & MESI, MSI & MESI, MSI & MESI, MSI \\ \hline
FS2 & MESI, MSI & MESI, MSI & MESI, MSI \\ \hline
\end{tabular}
\section{Benchmarks}
There are 5 benchmark programs available for RSIM from the
SPLASH [5] and SPLASH-2 [6] sets: mp3d, fft, radix, lu, and
water. These will be described in the results section.
\section{Data--sets}
Condor [4], is a system for high--throughput computing
available here at the UW. It is a great tool for architects
because simulations typically take days to finish. Condor
allows many of these jobs to be distributed to idle machines.
Normal condor jobs are check-pointed and ``migrated'' off
machines when a user logs on or the machine is to be powered
off. This allows the job continue where it left off when
another machine becomes available.
Because of some tricky code in RSIM, it would not work
correctly when linked against the Condor libraries. Thus we
were limited to running RSIM as a ``vanilla'' job --- that is,
when the job is interrupted by a user or the machine being
powered down, instead of check-pointing the job so it could
continue on another machine, the vanilla jobs are killed and
must be restarted. Because of this, we had to make sure our
simulations would finish in less than about 12 hours! This
limited the number of processors we could simulate and the
size of the data-sets run in each application.
We have no results for water because none of the nontrivial
data-sets could finish in the average idle time of a machine
in the Condor pool.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Results}
\include{fft}
\include{mp3d}
\include{lu}
\include{radix}
% \section{Summary}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Conclusions}
\section{Hypothesis}
We found that for the majority of the cases examined, network
quality {\it does} dominate coherence protocol choice.
It was more strongly true for benchmarks which are not
communication--intensive. The only benchmark which failed the
hypothesis was RADIX, which requires a large amount of
many--to--one and one--to--many communication every iteration.
The effect of protocol choice can be overcome by network
quality. We found that increasing network width always
dominated the protocol differences, while decreasing the
network delay dominated protocol differences only ``most of
the time.''
Further, it could be said that designers {\it should} dominate
the protocol differences by the network quality because all
programs will benefit from a better network, and not all
benchmarks do better with MESI. As we found, some do much worse.
\section{Protocol Importance}
The choice of coherence protocol can be an important factor in
system design. We saw that for the majority of the benchmarks
we examined, protocol choice had little effect on performance,
while network quality had the primary effect. However, for
the one benchmark which had the worst slow--down, the protocol
also had major impact on performance. Thus protocol
choice must be paid special attention to during design: which
protocol will perform better for the expected workloads?
There is room here for future work.
\section{Network Quality Importance}
Network quality has the largest impact on performance. A
faster network will improve all programs which communicate,
with no performance cost --- the cost involved is engineering
and materials cost.
\section{Application Dependence}
Different applications have difference communication
requirements. Some, like RADIX, are very network--intensive,
while others tolerate low--quality networks with little
performance degradation (FFT).
\section{Future Work}
We were surprised to see such a wide variance in maximum
runtime difference between the benchmarks. This suggests a
larger set of programs should be studied to get a better idea
how a typical workload behaves.
This study has given us some idea about the relative
importance of network quality as compared to protocol choice.
The next step would be to compare the hardware and
engineering costs to determine how trade-offs should be made.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{References}
\parindent 0pt
[0] {\it This Document.}\\
{\tt http://www.cae.wisc.edu/$\tilde{\ } $smithz/texts/ece757-report/report.ps}
\par
\vskip .2in
[1] P. SWEAZEY and A. J. SMITH, ``A Class of Compatible Cache
Consistency Protocols and their Support by the IEEE Futurebus'', {\it
Proc. Thirteenth International Symposium on Computer Architecture},
Tokyo, Japan (June 1986), 414-423.
\par
\vskip .2in
[2] S. V. ADVE and K. GHARACHORLOO, ``Shared Memory Consistency Models:
A Tutorial'', {\it IEEE Computer}, 29, 12 (December 1996), 66-76.
\par
\vskip .2in
[3] V. S. PAI, P. RANGANATHAN, and S. V. ADVE, ``RSIM Reference Manual
--- Version 1.0'', Rice University, Dept of ECE, Technical Report
9705, August 1997.
\par
\vskip .2in
[4] Condor HTC Homepage. {\tt http://www.cs.wisc.edu/condor/}, April 1998.
\par
\vskip .2in
[5] J. P. SINGH, W. WEBER, and A. GUPTA, ``SPLASH: Stanford Parallel
Applications for Shared--Memory'', Stanford University, Computer
Systems Laboratory.
\par
\vskip .2in
[6] S. C. WOO, M. OHARA, E. TORRIE, J. P. SINGH, and A. GUPTA, ``The
SPLASH-2 Programs: Characterization and Methodological
Considerations'', ISCA 1995.
\par
\vskip .2in
[7] PAI, RANGANATHAN, ADVE, and HARTON, ``An Evaluation of Memory
Consistency Models for Shared Memory Systems with ILP Processors'', ASPLOS 1996.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}
% LocalWords: Arifin MSI MESI Mhz ns RO RSIM ILP CC NUMA FD FS mp fft lu vs CS
% LocalWords: ECE Zak Mostafa pt Condor UW condor http www cae wisc edu smithz
% LocalWords: ece ps SWEAZEY Futurebus Proc ADVE GHARACHORLOO PAI RANGANATHAN
% LocalWords: HTC Homepage cs SINGH GUPTA OHARA TORRIE ISCA HARTON ASPLOS
% LocalWords: LocalWords
|