Glossary

Key Points

Introduction	Linear systems in both dense and sparse form are a universal theme in scientific computing. Dense and sparse matrices have different optimal algorithms.
Computational Kernels	Most applications are memory-bound, which complicates parallelization. Compute-bound applications depend on peak theoretical flops. Getting good performance from parallel solvers is hard.
BLAS	Linking with vendor-optimized libraries is a pain in the neck. Standard BLAS/LAPACK doesn’t use co-processors.
HPL	HPL requires tuning matrix and tile sizes to achieve peak performance. Weak scaling requires very large problem sizes for large supercomputers.
HPCG	HPCG requires tuning job launch configurations (like NUMA) to achieve peak communication bandwidth. Few-node performance should be an indicator of full-scale performance.
Summary	Cluster configuration has reproducibility challenges. Spend extra time to plan well-defined performance tests.
Using GPUs	Using GPUs requires modifications to the standard HPL and HPCG programs.
Using Infiniband	Infiniband networks do not carry standard ethernet traffic, requiring special .
Scripting OpenStack

gemm = general matrix multiply
gesv = general matrix vector solve
LU-decomposition = factorization of a matrix into the product of lower and upper-triangular matrices
- This is used for dense matrices.
pivot = operation of swapping two rows of a matrix in order to move a chosen element (the pivot element) onto the diagonal
dense matrix = matrix where O(N^2) elements are nonzero.
sparse matrix = matrix where most of the elements are zero.
- Typically a compressed “sparse” storage format is used for matrices containing only O(N) nonzero elements.
matrix block / tile = small sub-matrix consisting of a contiguous range of rows and columns
dot product = product of two vectors, summed over all elements
multigrid = hierarchical method of solving a matrix-vector problem based on coarsening to a smaller size and eventual refinement to the original size
conjugate gradient = optimization method employing successive vector directions based on the gradient of an objective function
preconditioner = approximate matrix inverse leaving a matrix-vector problem “closer” to a solution
arithmetic intensity = floating point operations per byte of data transferred to the processor doing the operations
GFLOPS = 1024^3 floating point operations per second