GitHub - jamtrott/ellspmv: Example program for sparse matrix-vector multiplication with a matrix in ELLPACK format.

jamtrott / ellspmv Public

Notifications You must be signed in to change notification settings
Fork 1
Star 2

Example program for sparse matrix-vector multiplication with a matrix in ELLPACK format.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
AUTHORS		AUTHORS
COPYING		COPYING
ChangeLog		ChangeLog
Makefile		Makefile
README		README
build_a64fx.sh		build_a64fx.sh
csrspmv.c		csrspmv.c
ellspmv.c		ellspmv.c
papi_util.c		papi_util.c
papi_util.h		papi_util.h
papi_util_a64fx_memdp.txt		papi_util_a64fx_memdp.txt
run_a64fx.sh		run_a64fx.sh
test.mtx		test.mtx

Repository files navigation

This is the README file for csrspmv and ellspmv, programs for
benchmarking sparse matrix-vector multiplication (SpMV) for matrices in
CSR and ELLPACK format.

  Copyright (C) 2023 James D. Trotter

  Copying and distribution of this file, with or without modification,
  are permitted in any medium without royalty provided the copyright
  notice and this notice are preserved.

Building
--------

The csrspmv and ellspmv programs can be built with `make'. Compilation
and linking may be configured through the environment variable `CC',
which is used to choose a compiler, and `CFLAGS' and `LDFLAGS', which
are used to set compiler flags and linker flags, respectively. Here is
an example:

     make CC=gcc CFLAGS="-O3 -march=native"

Furthermore, the following preprocessor macros can be used to control
various options:

 - If IDXTYPEWIDTH is set to 32 or 64, then 32-bit or 64-bit integers,
   respectively, are used for indexing matrix rows and columns.
   By default, IDXTYPEWIDTH is not set and values of type `int' are
   used. As a result, the maximum number of matrix rows or columns is
   typically limited to 2^31-1 or 2 147 483 647 unless 64-bit integers
   are explicitly requested.

 - If HAVE_ALIGNED_ALLOC is set, then memory allocations are aligned
   to a page.

 - If HAVE_LIBZ is set, then support for reading gzip-compressed
   Matrix Market files is enabled. Note that zlib header files must be
   available (i.e., an include path for zlib header files may need to
   be added to CFLAGS), and it is also necessary to link with libz,
   e.g., by adding `-lz' to LDFLAGS.

 - If HAVE_PAPI is set, then support for hardware performance
   monitoring using the PAPI library (https://icl.utk.edu/papi/) is
   enabled. Note that PAPI header files must be available (i.e., an
   include path for PAPI header files may need to be added to CFLAGS),
   and it is also necessary to link with libpapi, e.g., by adding
   `-lpapi' to LDFLAGS.

   To monitor hardware performance events and obtain performance
   metrics, an event file must be provided using the option
   `--papi-event-file'. See papi_util_a64fx_memdp.txt for an example
   event file for measuring various metrics related to cache- and
   memory bandwidth utilization on Fujitsu A64FX.

 - If USE_A64FX_SECTOR_CACHE is set and the Fujitsu C compiler is
   used, then the sector cache feature of the Fujitsu A64FX processor
   is configured to isolate streaming and non-streaming data to
   different cache partitions. The number of ways in the L1 and L2
   caches that are allocated to streaming data may be configured by
   setting A64FX_SECTOR_CACHE_L1_WAYS and A64FX_SECTOR_CACHE_L2_WAYS,
   respectively. By default, A64FX_SECTOR_CACHE_L1_WAYS is not set,
   and the L1 cache partitioning is not used, whereas
   A64FX_SECTOR_CACHE_L2_WAYS defaults to 4 out of 16 ways.

   To use the sector cache feature, it is also necessary to set the
   environment variables FLIB_HPCFUNC=TRUE and FLIB_SCCR_CNTL=TRUE at
   runtime. It is also recommended to set FLIB_L1_SCCR_CNTL=FALSE,
   which will result in an error if the L2 sector cache is
   unavailable.

 - If the Fujitsu C compiler is used, then it is possible to control
   the prefetch distance of the L1 and L2 hardware prefetchers with
   the options --l1-prefetch-distance=N and --l2-prefetch-distance=N.
   The argument N must be an integer from 0 to 15, which specifies the
   distance for the L1 or L2 prefetcher as a multiple of 256 bytes or
   1 KiB, respectively.


Usage
-----

csrspmv and ellspmv are used to load matrices from files in Matrix
Market format (see https://math.nist.gov/MatrixMarket/formats.html),
convert them to CSR or ELLPACK representation, and then multiply them
with a dense vector. The vector may be loaded from another file in
Matrix Market format. Otherwise, a vector with every element set to
one is used.

Since OpenMP is used for shared-memory parallel computations, the
environment variable `OMP_NUM_THREADS' can be set to control the
number of threads that are used. In addition, `OMP_PROC_BIND' can be
set to bind threads to particular cores.

If the option `--verbose' is supplied, then some information about the
matrix is printed, as well as the information about the matrix-vector
multiplication, such as the time spent and number of arithmetic
operations performed.

Here is an example from a dual socket Intel Xeon Gold 6130 CPU system,
where AVX-512 is used for vectorisation. First, we compile the code
with GCC 11.2.0. Using the option `-fopt-info-vec', we get some extra
information that shows where vectorization is applied:

    $ make CC=gcc-11 CFLAGS="-march=native -O3 -fopt-info-vec -ffast-math -mprefer-vector-width=512 -fopenmp"
    gcc-11 -c -march=native -O3 -fopt-info-vec -ffast-math -mprefer-vector-width=512 -fopenmp ellspmv.c -o ellspmv.o
    ellspmv.c:773:44: optimized: loop vectorized using 64 byte vectors
    ellspmv.c:773:44: optimized: loop vectorized using 32 byte vectors
    ellspmv.c:548:14: optimized: loop vectorized using 64 byte vectors
    ellspmv.c:548:14: optimized: loop vectorized using 32 byte vectors
    ellspmv.c:525:27: optimized: loop vectorized using 64 byte vectors
    ellspmv.c:525:27: optimized: loop vectorized using 32 byte vectors
    ellspmv.c:501:35: optimized: loop vectorized using 64 byte vectors
    ellspmv.c:501:35: optimized: loop vectorized using 32 byte vectors
    ellspmv.c:710:13: optimized: basic block part vectorized using 64 byte vectors
    ellspmv.c:735:13: optimized: basic block part vectorized using 64 byte vectors
    ellspmv.c:735:13: optimized: basic block part vectorized using 64 byte vectors
    ellspmv.c:791:13: optimized: basic block part vectorized using 64 byte vectors
    gcc-11 -march=native -O3 -fopt-info-vec -ffast-math -mprefer-vector-width=512 -fopenmp ellspmv.o  -o ellspmv

Next, we test the matrix-vector multiplication on a large matrix with
16 nonzeros per row. With standard OpenMP environment variables, we
use 32 threads and pin each thread to its own CPU core:

    $ OMP_PROC_BIND=true OMP_PLACES=cores OMP_NUM_THREADS=32 ./ellspmv --repeat=10 --verbose Lynx68_reordered/Lynx68_reordered.mtx
    mtxfile_read: 32.034946 seconds (99.7 MB/s)
    ell_from_coo: 0.772920 seconds, 6,810,586 rows, 115,779,962 nonzeros, 16 nonzeros per row
    gemv16: 0.011727 seconds (9.5 Gnz/s, 19.7 Gflop/s, 125.4 GB/s)
    gemv16: 0.009932 seconds (11.2 Gnz/s, 23.3 Gflop/s, 148.1 GB/s)
    gemv16: 0.009946 seconds (11.2 Gnz/s, 23.3 Gflop/s, 147.9 GB/s)
    gemv16: 0.010174 seconds (11.0 Gnz/s, 22.8 Gflop/s, 144.6 GB/s)
    gemv16: 0.009940 seconds (11.2 Gnz/s, 23.3 Gflop/s, 148.0 GB/s)
    gemv16: 0.009886 seconds (11.3 Gnz/s, 23.4 Gflop/s, 148.8 GB/s)
    gemv16: 0.009950 seconds (11.2 Gnz/s, 23.3 Gflop/s, 147.9 GB/s)
    gemv16: 0.009953 seconds (11.2 Gnz/s, 23.3 Gflop/s, 147.8 GB/s)
    gemv16: 0.010049 seconds (11.1 Gnz/s, 23.0 Gflop/s, 146.4 GB/s)
    gemv16: 0.010111 seconds (11.0 Gnz/s, 22.9 Gflop/s, 145.5 GB/s)

The matrix-vector multiplication kernel achieves a throughput of about
148 GB/s, which is about 58% of the 256 GB/s theoretical memory
bandwidth of this system.

Copying
-------
csrspmv and ellspmv are free software. See the file COPYING for
copying conditions.