-
Notifications
You must be signed in to change notification settings - Fork 1
Example program for sparse matrix-vector multiplication with a matrix in ELLPACK format.
License
jamtrott/ellspmv
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is the README file for csrspmv and ellspmv, programs for benchmarking sparse matrix-vector multiplication (SpMV) for matrices in CSR and ELLPACK format. Copyright (C) 2023 James D. Trotter Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. Building -------- The csrspmv and ellspmv programs can be built with `make'. Compilation and linking may be configured through the environment variable `CC', which is used to choose a compiler, and `CFLAGS' and `LDFLAGS', which are used to set compiler flags and linker flags, respectively. Here is an example: make CC=gcc CFLAGS="-O3 -march=native" Furthermore, the following preprocessor macros can be used to control various options: - If IDXTYPEWIDTH is set to 32 or 64, then 32-bit or 64-bit integers, respectively, are used for indexing matrix rows and columns. By default, IDXTYPEWIDTH is not set and values of type `int' are used. As a result, the maximum number of matrix rows or columns is typically limited to 2^31-1 or 2 147 483 647 unless 64-bit integers are explicitly requested. - If HAVE_ALIGNED_ALLOC is set, then memory allocations are aligned to a page. - If HAVE_LIBZ is set, then support for reading gzip-compressed Matrix Market files is enabled. Note that zlib header files must be available (i.e., an include path for zlib header files may need to be added to CFLAGS), and it is also necessary to link with libz, e.g., by adding `-lz' to LDFLAGS. - If HAVE_PAPI is set, then support for hardware performance monitoring using the PAPI library (https://icl.utk.edu/papi/) is enabled. Note that PAPI header files must be available (i.e., an include path for PAPI header files may need to be added to CFLAGS), and it is also necessary to link with libpapi, e.g., by adding `-lpapi' to LDFLAGS. To monitor hardware performance events and obtain performance metrics, an event file must be provided using the option `--papi-event-file'. See papi_util_a64fx_memdp.txt for an example event file for measuring various metrics related to cache- and memory bandwidth utilization on Fujitsu A64FX. - If USE_A64FX_SECTOR_CACHE is set and the Fujitsu C compiler is used, then the sector cache feature of the Fujitsu A64FX processor is configured to isolate streaming and non-streaming data to different cache partitions. The number of ways in the L1 and L2 caches that are allocated to streaming data may be configured by setting A64FX_SECTOR_CACHE_L1_WAYS and A64FX_SECTOR_CACHE_L2_WAYS, respectively. By default, A64FX_SECTOR_CACHE_L1_WAYS is not set, and the L1 cache partitioning is not used, whereas A64FX_SECTOR_CACHE_L2_WAYS defaults to 4 out of 16 ways. To use the sector cache feature, it is also necessary to set the environment variables FLIB_HPCFUNC=TRUE and FLIB_SCCR_CNTL=TRUE at runtime. It is also recommended to set FLIB_L1_SCCR_CNTL=FALSE, which will result in an error if the L2 sector cache is unavailable. - If the Fujitsu C compiler is used, then it is possible to control the prefetch distance of the L1 and L2 hardware prefetchers with the options --l1-prefetch-distance=N and --l2-prefetch-distance=N. The argument N must be an integer from 0 to 15, which specifies the distance for the L1 or L2 prefetcher as a multiple of 256 bytes or 1 KiB, respectively. Usage ----- csrspmv and ellspmv are used to load matrices from files in Matrix Market format (see https://math.nist.gov/MatrixMarket/formats.html), convert them to CSR or ELLPACK representation, and then multiply them with a dense vector. The vector may be loaded from another file in Matrix Market format. Otherwise, a vector with every element set to one is used. Since OpenMP is used for shared-memory parallel computations, the environment variable `OMP_NUM_THREADS' can be set to control the number of threads that are used. In addition, `OMP_PROC_BIND' can be set to bind threads to particular cores. If the option `--verbose' is supplied, then some information about the matrix is printed, as well as the information about the matrix-vector multiplication, such as the time spent and number of arithmetic operations performed. Here is an example from a dual socket Intel Xeon Gold 6130 CPU system, where AVX-512 is used for vectorisation. First, we compile the code with GCC 11.2.0. Using the option `-fopt-info-vec', we get some extra information that shows where vectorization is applied: $ make CC=gcc-11 CFLAGS="-march=native -O3 -fopt-info-vec -ffast-math -mprefer-vector-width=512 -fopenmp" gcc-11 -c -march=native -O3 -fopt-info-vec -ffast-math -mprefer-vector-width=512 -fopenmp ellspmv.c -o ellspmv.o ellspmv.c:773:44: optimized: loop vectorized using 64 byte vectors ellspmv.c:773:44: optimized: loop vectorized using 32 byte vectors ellspmv.c:548:14: optimized: loop vectorized using 64 byte vectors ellspmv.c:548:14: optimized: loop vectorized using 32 byte vectors ellspmv.c:525:27: optimized: loop vectorized using 64 byte vectors ellspmv.c:525:27: optimized: loop vectorized using 32 byte vectors ellspmv.c:501:35: optimized: loop vectorized using 64 byte vectors ellspmv.c:501:35: optimized: loop vectorized using 32 byte vectors ellspmv.c:710:13: optimized: basic block part vectorized using 64 byte vectors ellspmv.c:735:13: optimized: basic block part vectorized using 64 byte vectors ellspmv.c:735:13: optimized: basic block part vectorized using 64 byte vectors ellspmv.c:791:13: optimized: basic block part vectorized using 64 byte vectors gcc-11 -march=native -O3 -fopt-info-vec -ffast-math -mprefer-vector-width=512 -fopenmp ellspmv.o -o ellspmv Next, we test the matrix-vector multiplication on a large matrix with 16 nonzeros per row. With standard OpenMP environment variables, we use 32 threads and pin each thread to its own CPU core: $ OMP_PROC_BIND=true OMP_PLACES=cores OMP_NUM_THREADS=32 ./ellspmv --repeat=10 --verbose Lynx68_reordered/Lynx68_reordered.mtx mtxfile_read: 32.034946 seconds (99.7 MB/s) ell_from_coo: 0.772920 seconds, 6,810,586 rows, 115,779,962 nonzeros, 16 nonzeros per row gemv16: 0.011727 seconds (9.5 Gnz/s, 19.7 Gflop/s, 125.4 GB/s) gemv16: 0.009932 seconds (11.2 Gnz/s, 23.3 Gflop/s, 148.1 GB/s) gemv16: 0.009946 seconds (11.2 Gnz/s, 23.3 Gflop/s, 147.9 GB/s) gemv16: 0.010174 seconds (11.0 Gnz/s, 22.8 Gflop/s, 144.6 GB/s) gemv16: 0.009940 seconds (11.2 Gnz/s, 23.3 Gflop/s, 148.0 GB/s) gemv16: 0.009886 seconds (11.3 Gnz/s, 23.4 Gflop/s, 148.8 GB/s) gemv16: 0.009950 seconds (11.2 Gnz/s, 23.3 Gflop/s, 147.9 GB/s) gemv16: 0.009953 seconds (11.2 Gnz/s, 23.3 Gflop/s, 147.8 GB/s) gemv16: 0.010049 seconds (11.1 Gnz/s, 23.0 Gflop/s, 146.4 GB/s) gemv16: 0.010111 seconds (11.0 Gnz/s, 22.9 Gflop/s, 145.5 GB/s) The matrix-vector multiplication kernel achieves a throughput of about 148 GB/s, which is about 58% of the 256 GB/s theoretical memory bandwidth of this system. Copying ------- csrspmv and ellspmv are free software. See the file COPYING for copying conditions.
About
Example program for sparse matrix-vector multiplication with a matrix in ELLPACK format.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published