Study of Relative Performance Impact: Coherence Protocol vs. Network Speed college
topic: ECE757 (Multiprocessor Computer Architecture), Univ. of Wisconsin - Madison
Architects and implementors must know design trade-offs to design successful systems. Two important aspects of multiprocessor performance are network quality and coherence protocol. We study the performance impact of coherence protocol choice (MSI vs. MESI) as compared to the performance impact of the high-speed network. A higher quality network is either wider, or has less latency, or both.

We find that in the majority of cases, network quality dominates the effects of coherence protocol - that is, lower-quality networks with the MESI protocol do not yield faster runs that higher-quality networks with the MSI protocol. We also note that program characteristics can affect which protocol performs better.

formats: Adobe PDF (158.9kB), PostScript (233.6kB), TeX (11.5kB) 1998-05-01 quality 5

% -*- latex -*-

\documentclass[12pt]{report}

% \setlength{\hoffset}{-1in}
% \setlength{\voffset}{0in}
% \setlength{\topmargin}{0pt}
% \setlength{\oddsidemargin}{0in}
% \setlength{\headheight}{0pt}
% \setlength{\headsep}{0pt}

% \setlength{\textheight}{9.00in}
% \setlength{\textwidth}{6in}

\usepackage{psfig}


\begin{document}


\begin{titlepage}

\title{Study of Relative Performance Impact: Coherence Protocol vs. Network Speed
\footnote{Completed as semester project for CS/ECE 757, Prof. Mark Hill}}
\author{Zak Smith \and Mostafa Arifin}
\maketitle

\begin{abstract}
	Architects and implementors must know design trade--offs to
	design successful systems.  Two important aspects of
	multiprocessor performance are {\it network quality} and
	{\it coherence protocol}.  We study the performance impact of
	coherence protocol choice (MSI vs. MESI) as compared to the
	performance impact of the high--speed network.  A higher
	quality network is either wider, or has less latency, or both.

	We find that in the majority of cases, {\it network quality
	dominates the effects of coherence protocol} --- that is,
	lower--quality networks with the MESI protocol do not yield
	faster runs that higher--quality networks with the MSI
	protocol.  We also note that program characteristics can
	affect which protocol performs better.




\end{abstract}

\end{titlepage}

\newpage
\tableofcontents
\newpage


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \baselineskip = 22pt

\chapter{Introduction}

\section{Topic Importance}
	In order for a machine to achieve high performance within a
	cost constraint, architects and implementors must know the
	benefits and trade-offs for different design decisions.

	The performance of the interconnection network is key in
	multiprocessor systems.   In all but ``embarrassingly
	parallel'' applications, communication between processing
	elements provides those elements with the data they need to
	continue processing.

	Our study examines three parameters which affect network
	performance: coherence protocol, flit delay, and flit size.
	We believe this type of study is important to verify
	theoretical models with real benchmark results.

\section{Hypothesis}
	During lecture, it was noted that a good implementation of the
	interconnection network often dominates other factors such as
	network topology.   Since significant time was spent
	discussing coherence protocols [1], we decided to determine
	the relative performance impact of coherence protocol as
	compared to ``quality of network implementation.''  Thus, our
	hypothesis: 
	
	{\it A good network implementation dominates the effects of
	choosing a specific coherence protocol. }

\section{Definitions}

\subsection{Flit Delay}
	The flit delay is the time it takes for a flit to pass through
	a network switch, relative to the clock speed of the
	processor.  For example, if the flit delay was 2, and the
	processor was running at 300 Mhz, it would take 6.6 ns for a
	flit to propagate through a switch.  This primarily affects latency.
	

\subsection{Flit Width}
	The flit width specifies the number of bytes in each network
	flit, which is equivalent to the width of the network.  This
	primarily affects bandwidth.

\subsection{Coherence Protocol}
	The two coherence protocols we look at are the MSI and MESI
	protocols discussed in lecture and in [1].	

	\begin{tabbing}
	MESI 	\= ~~~~~~~~~~~~~~ \= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\
	M 	\> Modified	\> This is the only copy, dirty. \\
	E	\> Exclusive	\> Ensured to be the only RO copy. \\
	S	\> Shared	\> One of two or more RO copies. \\
	I	\> Invalid	\> Block not present. \\
	\end{tabbing}


	\begin{tabbing}
	MSI 	\= ~~~~~~~~~~~~~~ \= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\
	M 	\> Modified	\> This is the only copy, dirty. \\
	S	\> Shared	\> One of one or more RO copies. \\
	I	\> Invalid	\> Block not present. \\
	\end{tabbing}
	

	Thus the primary difference between the two is that, for MSI,
	if a processor reads a block known to no other processor and
	then writes it, it must notify the directory (S to M).  For
	MESI, this is not the case: a processor that reads the block
	first has it in Exclusive and can move to Modified without any
	traffic to the directory (E to M).

	At first glance, it would appear that MESI would perform
	better, although as we will see later, there can be
	implementation issues which make this not the case.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Methods}

\section{Simulator Model}
	We chose to use RSIM [7,3], an ILP multiprocessor simulator
	developed at Rice University.   The system simulates a
	collection of workstation--like nodes connected in a 2d mesh
	interconnection network, with CC-NUMA based on a distributed
	directory.

	We simulated a 16 processor system to control runtime of our
	simulations.

\section{Search-space}
	One of the parameters we want to change is ``quality of
	network implementation.''  We decided this is basically a
	combination of flit delay (related to the clock speed of the
	network), and the flit size (width of the network).  In other
	words, the ``best'' network will have the fastest clock speed
	and the widest data-path.  The ``worst'' network will have the
	slowest clock speed and the the narrowest data-path.

	To cover a range of network qualities, we chose to use flit
	widths of {2, 4, 8} bytes, and flit delays of {2, 4, 8}
	processor clocks.  Thus for each benchmark, we have 18
	data-points:
	
	\vskip .1in

	\begin{tabular}{|r||c|c|c|} %%%%%%%%%%%
	\hline
	    & FD2 	& FD4 		& FD8 		\\ \hline \hline
	FS8 & MESI, MSI & MESI, MSI	& MESI, MSI	\\ \hline
	FS4 & MESI, MSI & MESI, MSI	& MESI, MSI	\\ \hline
	FS2 & MESI, MSI & MESI, MSI	& MESI, MSI	\\ \hline
	\end{tabular}
	
	

\section{Benchmarks}
	There are 5 benchmark programs available for RSIM from the
	SPLASH [5] and SPLASH-2 [6] sets: mp3d, fft, radix, lu, and
	water.  These will be described in the results section.


\section{Data--sets}
	Condor [4], is a system for high--throughput computing
	available here at the UW.  It is a great tool for architects
	because simulations typically take days to finish.  Condor
	allows many of these jobs to be distributed to idle machines.
	Normal condor jobs are check-pointed and ``migrated'' off
	machines when a user logs on or the machine is to be powered
	off.  This allows the job continue where it left off when
	another machine becomes available.

	Because of some tricky code in RSIM, it would not work
	correctly when linked against the Condor libraries.  Thus we
	were limited to running RSIM as a ``vanilla'' job --- that is,
	when the job is interrupted by a user or the machine being
	powered down, instead of check-pointing the job so it could
	continue on another machine, the vanilla jobs are killed and
	must be restarted.   Because of this, we had to make sure our
	simulations would finish in less than about 12 hours!    This
	limited the number of processors we could simulate and the
	size of the data-sets run in each application.

	We have no results for water because none of the nontrivial
	data-sets could finish in the average idle time of a machine
	in the Condor pool.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Results}

\include{fft}

\include{mp3d}

\include{lu}

\include{radix}

% \section{Summary}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Conclusions}

\section{Hypothesis}
	We found that for the majority of the cases examined, network
	quality {\it does} dominate coherence protocol choice.  

	It was more strongly true for benchmarks which are not
	communication--intensive.  The only benchmark which failed the
	hypothesis was RADIX, which requires a large amount of
	many--to--one and one--to--many communication every iteration.

	The effect of protocol choice can be overcome by network
	quality.  We found that increasing network width always
	dominated the protocol differences, while decreasing the
	network delay dominated protocol differences only ``most of
	the time.''

	Further, it could be said that designers {\it should} dominate
	the protocol differences by the network quality because all
	programs will benefit from a better network, and not all
	benchmarks do better with MESI.  As we found, some do much worse.

\section{Protocol Importance}
	The choice of coherence protocol can be an important factor in
	system design.  We saw that for the majority of the benchmarks
	we examined, protocol choice had little effect on performance,
	while network quality had the primary effect.   However, for
	the one benchmark which had the worst slow--down, the protocol
	also had major impact on performance.  Thus protocol
	choice must be paid special attention to during design: which
	protocol will perform better for the expected workloads?
	There is room here for future work.

\section{Network Quality Importance}
	Network quality has the largest impact on performance.  A
	faster network will improve all programs which communicate,
	with no performance cost --- the cost involved is engineering
	and materials cost.

\section{Application Dependence}
	Different applications have difference communication
	requirements.   Some, like RADIX, are very network--intensive,
	while others tolerate low--quality networks with little
	performance degradation (FFT).

\section{Future Work}
	We were surprised to see such a wide variance in maximum
	runtime difference between the benchmarks.   This suggests a
	larger set of programs should be studied to get a better idea
	how a typical workload behaves.

	This study has given us some idea about the relative
	importance of network quality as compared to protocol choice.
	The next step would be to compare the hardware and
	engineering costs to determine how trade-offs should be made.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\chapter{References}
\parindent 0pt

[0] {\it This Document.}\\
{\tt http://www.cae.wisc.edu/$\tilde{\ } $smithz/texts/ece757-report/report.ps}

\par
\vskip .2in

[1] P. SWEAZEY and A. J. SMITH, ``A Class of Compatible Cache
Consistency Protocols and their Support by the IEEE Futurebus'', {\it
Proc. Thirteenth International Symposium on Computer Architecture},
Tokyo, Japan (June 1986), 414-423.

\par
\vskip .2in

[2] S. V. ADVE and K. GHARACHORLOO, ``Shared Memory Consistency Models:
A Tutorial'', {\it IEEE Computer}, 29, 12 (December 1996), 66-76.


\par
\vskip .2in
[3] V. S. PAI, P. RANGANATHAN, and S. V. ADVE, ``RSIM Reference Manual
--- Version 1.0'', Rice University, Dept of ECE, Technical Report
9705, August 1997.

\par
\vskip .2in
[4] Condor HTC Homepage. {\tt http://www.cs.wisc.edu/condor/}, April 1998.
	
\par
\vskip .2in
[5] J. P. SINGH, W. WEBER, and A. GUPTA, ``SPLASH: Stanford Parallel
Applications for Shared--Memory'', Stanford University, Computer
Systems Laboratory.


\par
\vskip .2in
[6] S. C. WOO, M. OHARA, E. TORRIE, J. P. SINGH, and A. GUPTA, ``The
SPLASH-2 Programs: Characterization and Methodological
Considerations'', ISCA 1995.

\par
\vskip .2in
[7] PAI, RANGANATHAN, ADVE, and HARTON, ``An Evaluation of Memory
Consistency Models for Shared Memory Systems with ILP Processors'', ASPLOS 1996.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




\end{document}


% LocalWords:  Arifin MSI MESI Mhz ns RO RSIM ILP CC NUMA FD FS mp fft lu vs CS
% LocalWords:  ECE Zak Mostafa pt Condor UW condor http www cae wisc edu smithz
% LocalWords:  ece ps SWEAZEY Futurebus Proc ADVE GHARACHORLOO PAI RANGANATHAN
% LocalWords:  HTC Homepage cs SINGH GUPTA OHARA TORRIE ISCA HARTON ASPLOS
% LocalWords:  LocalWords

[Zak Smith] [zak@computer.org] [/~zak/documents/college/ece757-report/tex]
$Id: documents,v 1.5 2000/09/28 21:20:39 zak Exp zak $
documents was last modified Mon 07 Apr 2014 0:16:32
All text and photographs © copyright 1997-2009 Zak Smith, all rights reserved, unless otherwise noted.
documents took 1.47 msec and 1 queries to generate, at Fri 26 Apr 2024 10:22:19.