diff --git a/fig/tikz/first_iteration.tikz b/fig/tikz/first_iteration.tikz index 13338e5..9ef5e52 100644 --- a/fig/tikz/first_iteration.tikz +++ b/fig/tikz/first_iteration.tikz @@ -120,7 +120,7 @@ |[draw=none]| ncDegs: & $0$ & $3$ & $6$ & |[draw=none]| \\ |[draw=none]| n: & \color{red}{$2$} & \color{teal}{$5$} & \color{red}{$1$} & \color{red}{$3$} & \color{red}{$4$} & \color{red}{$2$} & \color{red}{$4$} & \color{red}{$2$} & \color{red}{$3$} & \color{teal}{$5$} & \color{teal}{$6$} & \color{red}{$1$} & \color{red}{$4$} & \color{teal}{$6$} & \color{red}{$4$} & \color{teal}{$5$} & |[draw=none]| \\ |[draw=none]| w: & $3$ & $5$ & $3$ & $2$ & $4$ & $2$ & $1$ & $4$ & $1$ & $3$ & $6$ & $5$ & $3$ & $2$ & $6$ & $2$ & |[draw=none]| \\ - |[draw=none]| nn: & $5$ & $5$ & $6$ & $1$ & $4$ & $4$ & |[draw=none]| \\ + |[draw=none]| nn: & $2$ & $2$ & $2$ & $1$ & $1$ & $1$ & |[draw=none]| \\ |[draw=none]| nw: & $5$ & $3$ & $6$ & $5$ & $3$ & $6$ & |[draw=none]| \\ }; \node[above=1pt of alignedArrays-1-3] {\tiny 1}; @@ -131,4 +131,4 @@ \node[above=1pt of alignedArrays-5-5] {\tiny 2}; \end{tikzpicture} \caption{\texttt{scan} operation on ncDegs}\label{tikz:graph-contraction} - \end{subfigure} \ No newline at end of file + \end{subfigure} diff --git a/main.tex b/main.tex index 1bddcd1..ebd4a2f 100644 --- a/main.tex +++ b/main.tex @@ -115,19 +115,23 @@ \noindent \textbf{GRAPH STRUCTURE:} \\ -Before discussing the solution I want to introduce the memorization strategy for the data structure. Normally, when working with graphs, the two data structures that come to mind are: +Before discussing the solution I want to briefly introduce the data structure. Normally, when working with graphs, the two data structures that come to mind are: \begin{itemize} - \item The \emph{adjacency matrix}, which represents any connection between two vertices, $i, j$ in the graph as a $1$ or a $0$ in a matrix\footnote{For simplicity's sale I consider an unweighted graph}. Let's suppose that the $|V|$ is $n$, then the amount of space occupied by a matrix in memory is $\O(n^2)$ while the cost of accessing element $(i, j)$, independently from the values of $i$ and $j$, is just $\O(1)$. - \item The \emph{adjacency list}, which stores only a reference to the neighbour for every node, usually memorized as an array of lists, therefore if $j$ is neighbour for vertex $i$ a reference to $j$ will be placed in the list of neighbours of $i$. Supposing that $|E|$ is $m$, then the cost of keeping such a structure in memory is $O(m + n)$ and the cost of accessing any neighbour for a generic vertex $i$ is going to be $\O(n)$, because in the worst case we could be accessing a graph in which all of the vertices are linked to $i$. + \item The \emph{adjacency matrix}, which represents any connection between two vertices, $i, + j$ in the graph as a $1$ or a $0$ in a matrix in position $(i, j)$\footnote{For simplicity's sale I consider an unweighted graph}. Let's suppose that the $|V|$ is $n$, then the amount of space occupied by a matrix in memory is $\O(n^2)$ while the cost of accessing element $(i, j)$, independently from the values of $i$ and $j$, is just $\O(1)$. + \item The \emph{adjacency list}, which stores a reference to every neighbour of every vertex. Usually memorized as an array of lists, therefore if $j$ is neighbour for vertex $i$ a reference to $j$ will be placed in the list of neighbours of $i$. Supposing that $|E|$ is $m$, then the cost of keeping such a structure in memory is $O(m + n)$ and the cost of accessing any neighbour for a generic vertex $i$ is going to be $\O(n)$, because in the worst case we could be accessing a graph in which all of the vertices are linked to $i$. \end{itemize} -The approach I followed for the implementation of the algorithm is, to quote the authors of~\cite{generic-he-boruvka} "a compromise between adjacency list and adjacency matrix". The \csr format is a form of encoding linearizing the structure of adjacency lists, to save space, using arrays to keep the adjacency lists in memory, to save time. The number of additional arrays implemented in \csr vary slightly in the literature~\cite{csr-kelly}~\cite{csr-wheatman}~\cite{generic-he-boruvka}, for my implementation I chose to stray from the 4-vector solution used in~\cite{generic-he-boruvka}, since it would have only complicated the structure of the graph without holding any additional information. +The approach I followed for the implementation of the algorithm is, to quote the authors +of~\cite{generic-he-boruvka} "a compromise between adjacency list and adjacency matrix". The \csr +encoding linearizes adjacency lists, to save space, using arrays to keep the adjacency lists in +memory, to save time. The number of additional arrays implemented in \csr varies slightly in the literature~\cite{csr-kelly}~\cite{csr-wheatman}~\cite{generic-he-boruvka}, for my implementation I chose to stray from the 4-vector solution used in~\cite{generic-he-boruvka}, since it would have only complicated the structure of the graph without holding any additional information. The array structure for my implementation consists of: \begin{itemize} \item A \emph{cumulated degrees} array, it's an array cumulating the degrees of the vertices in the graph, used also to compute the degree of every vertex. - \item An \emph{neighbours} array, which is a linearization of the graph's adjacency list. + \item A \emph{neighbours} array, which is a linearization of the graph's adjacency list. \item A \emph{weights} array, containing the weight of every edge $(i, j)$ in the graph. \end{itemize} -An example of the CSR structure is shown in Figure~\ref{tikz:csr-struct}. +An example of the \csr structure is shown in Figure~\ref{tikz:csr-struct}. \begin{figure} \centering @@ -143,12 +147,11 @@ \textbf{CPU IMPLEMENTATION:} \\ I will not cover at length the CPU implementation since it's quite straightforward, at first I opted -for an extremely na\"ive implementation of Prim's algorithm, consisting of two \texttt{for} cycles. -Since the implementation worked but was in fact extremely slow compared to finer solvers I -decided to scrap it in favour of a more efficient implementation of Prim's algorithm based on -the use of a simple heap to sort the edges needed to build the \mst. +for an extremely na\"ive version of Prim's algorithm, consisting of two \texttt{for} cycles. +Since the algorithm proved to be way too slow I decided to move to a faster solver based on heaps, +which has a time complexity of $\O(m \cdot \log{n})$, as was shown in~\ref{sec:intro}. -I additionally implemented \brkas algortithm which proves to have better performance, as shown in Section \ref{sec:performance-analysis}, when working with sparse graphs, if compared to the original Prim's solver, which has an edge on denser graphs. +I additionally implemented \brkas algortithm which proves to have better performance, as shown in Section \ref{sec:performance-analysis}, when working with sparse graphs. \bigskip \phantomsection @@ -162,21 +165,45 @@ deploying a thread per vertex (such an approach is referred to as \textit{topologic} in the literature), to better understand every step I took the example graph shown in~\ref{tikz:csr-struct} and I computed the first iteration of the algorithm step-by-step. -Following Algorithm~\ref{algo:boruvka-parallel} one iteration of the outer loops consists of: +Following Algorithm~\ref{algo:boruvka-parallel} one iteration of the outer loop consists of: \begin{enumerate} \item\label{item:choose-lightest} (\textit{choosing the lightest}), as shown in~\ref{tikz:find-cheapest}, this operation picks the lightest outgoing edge in every vertex neighbourhood in parallel. The result is going to be a single reference to the outgoing edge which will be stored in the \texttt{candidates} array. - Weight ties are broken by picking the neighbour with the smallest vertex-id. + Weight ties are broken by picking the neighbour with the smallest vertex-id. Considering + vertex $4$ shown in~\ref{tikz:find-cheapest} the cheapest edge has weight $1$ and + links $4$ to $3$, therefore in position $4$ inside of the new array we will be + putting a reference to $3$\footnote{In the actual implementation the offset from the + beginning of the vertex's neighbourhood is saved}. \item\label{item:mirror-removal} (\textit{removing loops}), as shown in~\ref{tikz:loop-removal}, this operation is meant to remove cycles from the graph. Whenever a cycle between vertices $i$ and $j$ is identified only the copy of the edge associated to the vertex with the highest vertex-id is kept, the other is set to the default value of \texttt{UINT\_MAX}. + + Considering vertex $4$ in shown in~\ref{tikz:loop-removal} we see that the candidate + edge for vertex $3$ leads to $4$, thus a loop has been found, as a result the + candidate for $3$ is replaced with the default value indicated as $\mathcal{U}$. \item\label{item:coloration} (\textit{Initializing and propagating colors}), as shown in~\ref{tikz:coloration}, this operation is meant to identify connected components inside the graph and can be implemented recursively using a kernel as recursion head and a series of \texttt{\_\_device\_\_} function calls to propagate the colors. The resulting \texttt{colors} array will contain the vertex-id of vertices with undefined candidate value and, for all other vertices, the color value will be computed recursively by using the color of the neighbour pointed by the \texttt{candidates} array. + + If we consider vertex $4$ shown in~\ref{tikz:coloration}, its candidate value is + pointing to $3$, therefore $4$ will belong to the same component $3$ belongs to, + since the candidate value for $3$ is $\mathcal{U}$ then it's going to be the root of the + connected component and its color will be set to its vertex id. \item\label{item:vertex-rename} (\textit{Renaming vertices}), as shown in~\ref{tikz:sv-rename}, to rename the vertices a new array is created (in my implementation is \texttt{flag}) that contains a $0$ in every position where the color of the vertex is different from the vertex id and a $1$ otherwise. Afterwards an exclusive scan procedure is comptued on \texttt{flag} to compute the new vertex-ids. + + Since during the previous step the algorithm recursively looks for the root of the + component we can easily merge the two steps together to have a reduction of the + kernel launch overhead. \item\label{item:graph-contraction} (\textit{Counting, assign and insert new edges}), as shown in the last part of Figure~\ref{tikz:first-iteration} this bigger operation has to be split into three different steps: \begin{itemize} \item\label{item:count-edges} (\textit{Count edges}), as shown in~\ref{tikz:edge-count}, a simple kernel will be checking every edge in a vertex's neighbourhood, comparing the color of the source with the color of the destination (effectively comparing the connected components they belong to). If they belong to different connected components the result will be added to the \texttt{newCumDegs} array that is being constructed. + + If we consider the neighbourhood for vertex $4$, the first edge leads to + $2$, which is colored red; since $4$ is colored red as well we do not count + it as an outgoing edge for the new supervertex (the edge stays within the + component). $4$ also has an edge leading to vertex $5$ which is colored in + teal, since its color is not red the edge is reaching another component, + therefore it is counted as one of the outgoing edges for supervertex $1$. \item\label{item:scan-ncd} (\textit{Cumulated degrees computation}), as shown in~\ref{tikz:scan-ncdegs}, the resulting \texttt{newCumDegs}, which contains the outdegrees for every connected component, undergoes a round of scan to finalize the computation of the cumulated degrees vector for the contracted graph. \item\label{item:graph-regen} (\textit{Graph contraction}), as shown in~\ref{tikz:graph-contraction}, this step generates the new neighbour and weight array using a logic that is very similar to the one used for the calculation of the new number of outgoing edges. @@ -188,7 +215,14 @@ \input{fig/tikz/first_iteration.tikz} \caption{Example of the first iteration}\label{tikz:first-iteration} \end{figure} -In the following section I will be comparing two very similar approaches, GPU-u, which follows closely the implementation guidelines of the original paper, and GPU-e, which is more experimental.GPU-e differs essentially in how it computes the new cumulated degrees vector and the graph contraction; instead of using a topological approach for both kernels I use one thread per edge and then proceed to do a binary search for each edge to find the source of the edge. The result is that I am moving the cost of doing a \texttt{for} loop on a single processor to visiting the neighbourhood with an extreme degree of parallelism. +In the following section I will be comparing two very similar approaches, GPU-u, which follows +closely the implementation guidelines of the original paper, and GPU-e, which is more experimental. +GPU-e differs essentially in how it computes the new cumulated degrees vector and the graph +contraction; instead of using a topological approach for both kernels I use one thread per edge and +then proceed to do a binary search for each edge to find the source of the edge. While this approach +is not scalable by definition due to the extreme requirements it might need to satisfy it bears the +germ of an idea that might prove interesting if more effort was to be put into it, more on this in +section~\ref{sec:final-thoughts}. \bigskip \phantomsection @@ -215,7 +249,11 @@ \textbf{Architecture} & N.A. & Turing \end{longtable} \end{center} -All of the benchmarks conducted are part of the tests provided for DIMACS\footnote{DIscrete MAthematics \& theoretical Computer Science center (\url{https://www.diag.uniroma1.it/challenge9/download.shtml})} center 9th challenge, the tests I used to get the results shown in~\ref{fig:results} are listed in table~\ref{tbl:benchmarks}. +All of the benchmarks conducted are part of the tests provided for DIMACS\footnote{DIscrete +MAthematics \& theoretical Computer Science center +(\url{https://www.diag.uniroma1.it/challenge9/download.shtml})} 9th challenge, the benchmark +instances are listed in table~\ref{tbl:benchmarks} and the results for the original paper have been +plotted alongside mine, for reference, in~\ref{fig:results}. \begin{longtable}{|c|c|c|} \caption{Benchmark dimensions}\label{tbl:benchmarks} \\\hline\textbf{Benchmark} & \textbf{Node size} & \textbf{Edge size} \\\hline\hline @@ -229,7 +267,7 @@ \textit{Great Lakes} & $\num{2758119}$ & $\num{6885658}$ \\\hline \textit{Eastern USA} & $\num{3598623}$ & $\num{8778114}$ \end{longtable} -The testing results plotted in~\ref{fig:results} was obtained by running the same tests listed in~\ref{tbl:benchmarks} $5$ times for both GPU approaches, average and standard deviation for the timings were then computed and plotted. I decided to include also some performance measures taken from the original paper, specifically the GPU and CPU (single thread) average elapsed times for reference. +The testing results plotted in~\ref{fig:results} were obtained by running every benchmark $5$ times for both GPU approaches, average and standard deviation for the timings were then computed and plotted. \begin{figure} \centering \includegraphics[scale=0.4]{fig/benchmarks.png} @@ -238,10 +276,25 @@ \end{figure} Considering the various CPU implementations plotted in Figure~\ref{fig:results} speedup for GPU-e, with respect to \brka implementation, ranges between $\sim42\times$ and $\sim92\times$, while the speedup for GPU-u ranges between $\sim33\times$ and $\sim83\times$. Considering Prim's implementation the speedup for GPU-e ranges between $\sim66\times$ and $\sim150\times$, while the speedup for GPU-u ranges between $\sim52\times$ and $\sim136\times$. -Comparing my GPU implementations with the CPU approaches proposed in\cite{generic-he-boruvka}, based on OpenMP, we can see that both of my solutions consistently beat the single-threaded version (seen in Figure~\ref{fig:results}) with an estimated speedup between $\sim5\times$ and $\sim10\times$ while the speedup for the multithreaded implementation is closer to $1$ (which can only be found in the source material). +Comparing my GPU implementations with the CPU approaches proposed in\cite{generic-he-boruvka}, based +on OpenMP, we can see that both of my solutions consistently beat the single-threaded version (seen +in Figure~\ref{fig:results}) with an estimated speedup between $\sim5\times$ and $\sim10\times$ +while the speedup for the multithreaded implementation is closer to $1$ (which can only be found in +the source material)\footnote{All comparisons the original paper have to be taken with a grain of +salt, the algorithm was not rerun on my machine, I decided to include the results in the analysis +just for reference but they weren't used as actual benchmark for my solution.}. \medskip -The \emph{occupancy} metric for a kernel $\mathcal{K}$ on a GPU is defined as the ratio of active warps on an SM to the maximum number of active warps supported by the SM~\cite{def-occupancy}. To test for it Nvidia provides the tool Nsight Compute (\texttt{ncu}). The results have been plotted in figure~\ref{fig:occupancy}, as we can see the general occupancy is inversely proportional to the number of iterations, that was to be expected because I followed a topologic approach, which means that the number of threads used by each kernel is going to be just as many as it's required to occupy every single thred with the computation for one vertex. Furthermore the number of threads per block is $1024$, thus the occupancy falls towards the end of the execution. +The \emph{occupancy} metric for a kernel $\mathcal{K}$ on a GPU is defined as the ratio of active +warps on an SM to the maximum number of active warps supported by the SM~\cite{def-occupancy}. To +test for it, Nvidia provides the tool Nsight Compute (\texttt{ncu}). The results have been plotted +in figure~\ref{fig:occupancy}, as we can see the general occupancy is inversely proportional to the +number of iterations, that was to be expected because I followed a topologic approach, which means +that the number of threads used by each kernel is going to be just as many as it's required to +occupy every single thread with the computation for one vertex (while having some extras). Since I +decided to have threads arranged in a static single-dimension block of 1024 the result is that the +number of active threads at the end of the algorithm (in the last couple of iterations) is really +low, compared to the total threads available. \begin{figure}[!h] \centering \includegraphics[scale=0.5]{fig/occupancy.png} @@ -261,9 +314,12 @@ Although the original paper was published under the title "A generic and highly efficient parallel variant of \brkas algorithm" it's immediately clear by taking a look at what kind of tests have been done on the solver (all similar to~\ref{tbl:benchmarks}) that the solution is not as "generic" as the authors would like it to be, and it's highly likely it will keep on performing on graphs with limited density. -As shown in Section~\ref{sec:performance-analysis} the results obtained by trying a different approach yielded mixed results on average, therefore it's not entirely clear whether an approach entirely based on visiting edges (which would work through segmented scans and similar) instead of vertices would be beneficial, and lead to better results, or detrimental, because of the heavy use of atomic operations and synchronization. +As shown in Section~\ref{sec:performance-analysis} the results obtained by trying a different +approach yielded mixed results on average, therefore it's not entirely clear whether an approach +entirely based on visiting edges (which would be based on the use of segmented scan) instead of vertices would be beneficial, and lead to better results, or detrimental, because of the heavy use of atomic operations and synchronization. -It would be interesting, to keep experimenting and finding different ways of computing \mst in parallel like~\cite{mst-bipartite} targetting especially less sparse trees. +It would be interesting, to keep experimenting and finding different ways of computing \mst in +parallel like~\cite{mst-bipartite} especially targetting graphs with higher densities. \clearpage