Introduction

Overview

Teaching: 10 min
Exercises: 0 min

Questions

Why are HPL and HPCG used as benchmarks?

What is a linear solve used for?

Objectives

Explain dense and sparse matrices.

List some applications that generate these types of problems.

High-Performance Linpack (HPL) and High-Performance Conjugate-Gradient (HPCG) are two of the most famous benchmarks for supercomputers. As a benchmark, they are intended to show the maximum achievable performance of the parallel machine on an important problem with real-world applications.

Where does the problem come from?

Both benchmarks solve for $X$ given $A$ and $B$ in the matrix-vector problem(s),

$A X = B, A \in \mathcal M_{n\times n}, X,B \in \mathcal M_{n\times k}$

The notation $\mathcal M_{n \times k}$ stands for an n by k matrix. The matrix $A$ is square, and is sometimes referred to as the “operator”. Its properties are very important in determining the algorithm used to solve for $X$. Each of the $k$ columns of the matrix $B$ are called “right-hand side vectors”.

Because of the way matrix multiplication is defined, each “right-hand side vector” corresponds to a column of $X$. This column of $X$ is also called the “unknown vector” or “solution vector”. Thus, $AX=B$ represents $k$ independent matrix-vector problems. We use lower-case $x$,$b$ to refer to a single $\mathcal M_{n\times 1}$ vector.

Matrix-vector problems like the above come from trying to solve linear systems of equations, like the Helmholtz equation,

$(\nabla^2 + k^2) X(r) = B(r)$

which represents the scattering of a sound-wave off of a surface, or a quantum wavefunction at a specific energy. Discretizing the field $X(r)$ by storing only some points, $X_i = X(r_i)$, leads to the matrix-vector form.

In this case, the matrix size (also known as dimension), $N$, is the number of sampled real-space points, $r_i$. In many applications, linear equations appear at each time-step, or as a Newton step toward solving a non-linear equation. It’s not difficult to see larger problem domains and more dimensions lead to very large linear systems with $N^2$ being as large as an entire supercomputer’s working memory!

HPCG and Sparse Matrices

The properties of the operator, $A$, determine the most efficient algorithm to use when solving a linear system. High-Performance Linpack solves dense linear systems and High-Performance Conjugate Gradient solves sparse ones.

A sparse matrix is one in which only order N ($O(N)$) elements are non-zero. The Helmholtz equation above is sparse because multiplication by $k^2$ and taking the derivative, $\nabla^2$ are only local operations. This means only a constant number of elements in $X$ are needed to compute $(A X)_i$. In one-dimension,

[A = \begin{pmatrix} (k\Delta x)^2-2 & 1 & 0 & 0 & 1
1 & (k\Delta x)^2-2 & 1 & 0 & 0
0 & 1 & (k\Delta x)^2-2 & 1 & 0
0 & 0 & 1 & (k\Delta x)^2-2 & 1
1 & 0 & 0 & 1 & (k\Delta x)^2-2 \end{pmatrix} \Delta x^{-2}]

If processor one holds the first two elements of $x$ and processor two holds the last three elements of $x$, then the two only need to communicate one element of $x$ in order for both to compute their part of the product, $A x$.

The method of conjuate gradients uses the fact that $Ax - b = 0$ in the final solution to iteratively adjust $x$ by computing $A\tilde x - b$ ($\tilde x$ is an estimate of $x$). Thus, the HPCG benchmark focuses on this communicate and multiply rows strategy.

Applying the $A$ operator to the right-hand side, $\tilde x$ is done by a sparse matrix multiplication. Since each row has a constant number of non-zero elements, that sparse multiplication takes $O(N)$ work instead of $O(N^2)$ for a dense matrix. In a “perfect” parallel machine with $P$ processors, the multiplication can be done in $O(N/P)$ time.

However, the cost of using an iterative method is that the number of iterations needed to obtain convergence to machine precision is unknown. The convergence of conjugate gradients has been extensively studied, with the result that the number of iterations is somewhere between $O(\sqrt \kappa)$ and $O(N)$, with some algorithms guaranteeing less then $O(\sqrt \kappa)$ iterations. Here, $\kappa$ is the condition number of the matrix, equal to the ratio of the largest to smallest matrix eigenvalue. There is theoretical evidence that (for matrices with Gaussian random elements) the condition number is proportional to the matrix dimension, $N$. This means in practice that the number of iterations required for convergence is on the order of $\sqrt N$, and that conjugate gradients on a “perfect” parallel machine scales as $O(N^{3/2} / P)$.

HPL and Dense Matrices

Dense matrices, on the other hand, are matrices in which $O(N^2)$ elements are nonzero. In this case, much more communication would be needed to multiply $Ax$. Rather than use conjugate gradients, dense matrix solves are more effectively accomplished by performing row-operations on the matrix, $A$.

The computational method, however, is a bit different than doing Gaussian elimination by hand. Instead, we just need to apply enough row operations to make $A$ upper-diagonal (turning $A$ into $U$). The operations can be collected into a lower-diagonal matrix, $L$, and a permutation matrix, $P$. Each permutation swaps two rows, so $P$ is a matrix containing $N$ ones. The complete process is summarized as,

$ A = P L U $

where all matrices are $N \times N$. This is called a “factorization” of $A$ into an “L-U” decomposition. The computational work required for L-U decomposition is $O(N^3)$.

Given the factorization, $A x = b$ can be solved by step-wise solution:

$ x = (PLU)^-1 b = U^{-1} (L^{-1} (P^{-1} b)) $

Permutation is just a communication step. Each of the next two multiplications take just about the same amount of work as a matrix-multiplication, $O(N^3)$.

On a “perfect” parallel machine with $P$ processors, this should take $O(N^3/P)$ time.

Key Points

Linear systems in both dense and sparse form are a universal theme in scientific computing.

Dense and sparse matrices have different optimal algorithms.

Computational Kernels

Overview

Teaching: 10 min
Exercises: 30 min

Questions

What are memory and compute-bound kernels?

Objectives

Run the OSU MPI and Mixbench GPU benchmarks.

Understand bottlenecks to peak performance.

High-performance compute clusters make use of specialized network hardware in order to maximize message bandwidth and minimize latency. In most cases, this means specialized “network fabric” libraries from the hardware vendors need to be installed. Then, some version of the MPI library should be built “on top” of these vendor communication libraries.

This linking process can be confusing - so it’s important to read the vendor and MPI library instructions carefully. Then benchmark your MPI by using it to exchange data and timing it. If the timings don’t match what the vendor says can be done, it probably means you are not compiling MPI or your application with the right options.

It could also mean you are launching your MPI process incorrectly, causing the wrong assignment of MPI ranks to processors and other machine resources.

Run the OSU MPI Benchmarks

Download the latest version of the OSU Microbenchmarks. Extract, compile and link with your version of MPI, and run the one-sided osu_get_latency and osu_get_bw benchmarks using 2 MPI ranks on the same host.

Plot the results to show latency and bandwidth as a function of packet size.

Solution

wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.7.1.tgz
tar xzf osu-micro-benchmarks-5.7.1.tgz
mkdir osu.default && cd osu.default
../osu-micro-benchmarks-*/configure CC=mpicc CXX=mpicxx CFLAGS=-I$(PWD)/../src/osu-micro-benchmarks-*/util --prefix=$PWD
cd libexec/osu-micro-benchmarks/mpi/one-sided
mpirun -n 2 ./osu_get_latency >latency.dat
mpirun -n 2 ./osu_get_bw >bw.dat
gnuplot <<.
set term eps enh color lw 2
set out "bw.eps"
set logscale x
set logscale y
set xlabel "Message Size / bytes"
set ylabel "Bandwidth / MB/sec"
plot "bw.dat" u 1:2 w l

set out "latency.eps"
set ylabel "Latency / us"
plot "latency.dat" u 1:2 w l
.

Now that we’ve covered how to test communication speed, it’s time to turn to local computation speed. Graphics processing units (GPUs) are constructed with matrix (image) processing in-mind. They achieve this by using very large vector-sizes and optimizing for performing single instruction multiple data (SIMD) operations while pre-fetching data from memory.

Reading and writing data to GPU memory takes many more clock cycles than performing operations like add, multiply, divide, on data that’s already present. This means that you have to do lots of operations on data that’s already sitting in the SIMD units in order for your kernel to be compute-bound.

The number of floating point operations done on every byte read is called “Arithmetic Intensity.” It controls the trade-off between reading a new batch of data and operating on the data that’s present.

If you do just one operation for every byte read, then the run-time will depend on the GPU’s maximum memory bandwidth. At some value of arithmetic intensity, however, you’ll be doing so many operations on every byte of data that the memory access will be completely hidden. In this case, the performance flattens to a “roofline”, at the peak floating point operations the processor is capable of performing.

You can see this behavior by running an invented kernel with an adjustable number of flops per byte, and plotting flops vs. arithmetic intensity.

Run Mixbench

Download the latest version of the Mixbench Microbenchmark and produce a “roofline” plot of achieved flops vs arithmetic intensity.

Solution

cd $HOME
git clone https://github.com/ekondis/mixbench.git
mkdir mixbench/build && cd mixbench/build
cmake -DCMAKE_CUDA_ARCHITECTURES=80 ../mixbench-cuda
make -j
./mixbench-cuda-alt | tee alt.log
grep ^ *[0-9] alt.log* | sed -e s/,//g >alt.dat
gnuplot <<.
set term eps enh color lw 2
set out "roofline.eps"
set xlabel "Arithmetic Intensity / FLOP/BYTE"
set ylabel "Bandwidth / MB/sec"
plot "bw.dat" u 2:4 w l title "single"
plot "bw.dat" u 6:8 w l title "double"
.

Peak performance

Both of the micro-benchmarks above show how communicating data can easily become a bottleneck for algorithm run-time. Even the very fast memory movement between GPU memory and GPU processors can occupy the time of hundreds of floating point operations.

When part of a matrix or vector needs to be communicated between MPI ranks before a solution step can proceed, then the whole computation runs the risk of running at the same speed as the network bandwidth. Even small-data operations like broadcasting the result of a local vector-vector dot product incur a latency cost.

To write an effective parallel algorithm, data movement must be overlapped with computation as much as possible. Further, scattering and gathering data between MPI ranks should make use of faster links between ranks on the same node as much as possible.

Finally, all these operations should be done as explicitly as possible so that the processors access local memory. Asking two processors to implicitly be aware of memory updates from one another causes the processes to interrupt each other. Assigning every MPI rank to a specific processor and bank of memory (using system NUMA controls) prevents these hidden interruptions.

Check for NUMA Issues

Add MPI init, send/receive, and finalize calls to mixbench.

Create a batch script that launches this MPI job over your entire cluster. Check the point-to-point bandwidth achieved between your send/receive calls and make sure that every rank is getting full use of the GPU.

If you have issues at this stage, it’s likely that your mpirun launch configuration is incorrect.

Key Points

Most applications are memory-bound, which complicates parallelization.

Compute-bound applications depend on peak theoretical flops.

Getting good performance from parallel solvers is hard.

BLAS

Overview

Teaching: 10 min
Exercises: 20 min

Questions

What blas should I link with?

How can I get all these libraries installed and working together?

What are the limitations of blas and lapack?

Objectives

Understand the differences between blas implementations.

Correctly compile, link, and test linear algebra libraries.

The compilers, GPU, and network are working well. Now it’s time to run a real high-intensity problem.

The following code shamelessly adapted from stack-overflow uses the lapack library to solve a small matrix-vector problem. It’s the single-node, CPU version of the problem HPL solves.

// https://stackoverflow.com/questions/36063993/lapack-dgetrs-vs-dgesv
#include <stdio.h>
#include <vector>
#include <stdlib.h>

using namespace std;

extern "C" {
  void daxpy_(int* n,double* alpha,double* dx,int* incx,double* dy,int* incy);
  double dnrm2_(int* n,double* x, int* incx);

  void dgetrf_(int* M, int *N, double* A, int* lda, int* IPIV, int* INFO);
  void dgetrs_(char* C, int* N, int* NRHS, double* A, int* LDA, int* IPIV, double* B, int* LDB, int* INFO);
  void dgesv_(int *n, int *nrhs, double *a, int *lda, int *ipiv, double *b, int *ldb, int *info);
}

void print(const char* head, int m, int n, double* a){
  printf("\n%s\n---------\n", head);
  for(int i=0; i<m; ++i){
    for(int j=0; j<n; ++j){
      printf("%f ", a[i+j*m]);
    }
    printf("\n");
  }
  printf("---------\n");
}

int main(int argc, char *argv[]) {
  int dim = 5;
  if(argc > 1)
    dim = atoi(argv[1]);

  vector<double> a(dim * dim);
  vector<double> b(dim);
  srand(1);              // seed the random # generator with a known value
  double maxr = (double)RAND_MAX;
  for (int r = 0; r < dim; r++) {  // set a to a random matrix, i to the identity
    for (int i = 0; i < dim; i++) {
      a[r + i*dim] = rand() / maxr;
    }
    b[r] = rand() / maxr;
  }

  int info;
  int one = 1;
  char N = 'N';
  vector<int> ipiv(dim);

  vector<double> a1(a);
  vector<double> b1(b);
  dgesv_(&dim, &one, a1.data(), &dim, ipiv.data(), b1.data(), &dim, &info);
  print("B1",dim < 5 ? dim : 5,1,b1.data());

  vector<double> a2(a);
  vector<double> b2(b);
  dgetrf_(&dim, &dim, a2.data(), &dim, ipiv.data(), &info);
  dgetrs_(&N, &dim, &one, a2.data(), &dim, ipiv.data(), b2.data(), &dim, &info);
  print("B2",dim < 5 ? dim : 5,1,b2.data());

  double d_m_one = -1e0;
  daxpy_(&dim,&d_m_one,b2.data(),&one,b1.data(),&one);
  printf("\ndiff = %e\n", dnrm2_(&dim,b1.data(),&one));

  return 0;
}

To actually compile and use this, however, you’ll need to install a BLAS and LAPACK library. This is where the actual fun begins. There are many, many different implementations of BLAS available.

On Mac OSX, you can use the system-wide installation by adding the compile flag -framework Accelerate.
On Linux, you’ll find libopenblas, libblas, libatlas, and libgslcblas. All three provide different implementations of the same BLAS functions. Both openblas and atlas include lapack. But, if you install libblas or libgslcblas, you don’t get the lapack functions. Instead, liblapack is a separate Linux package compatible only with libblas.
Intel offers BLAS/LAPACK through their Math Kernel Library (MKL). There are so many different variations provided by MKL that Intel provides a link line advisor to find the exact library names to link with depending on your variation. Note, the most common variation is for 64-bit processors with 32-bit integers and openmp.
AMD has written its own BLAS optimized for AMD platforms called blis. Blis has a lapack implementation called libFLAME.
Co-processor accelerators (GPUs and Xeon Phi cards) use separate blas libraries. For GPUs, these are cublas and rocblas. For Xeon Phi cards, the Intel MKL will work but special run-time setup is needed.
Cray systems provide BLAS and LAPACK through libsci. Their libsci-acc includes support for co-processor accelerators.
Both Intel MKL and Cray Libsci-Acc provide scalapack, which is an MPI-distributed version of lapack.
Rather than using blas and lapack, I recommend that new C++ codes target blaspp and lapackpp. These packages wrap the old Fortran blas and lapack interfaces into a nice templated C++ design. They are only wrappers, however, so installing them requires pointing them to an already-installed version of blas and lapack.

Information about the Linux packages can be obtained on Debian-based systems by querying the package manager with commands like aptitude search blas, and aptitude show liblapack3. On Redhat-based systems, use the corresponding yum search and yum info.

Blas Options

Not only do different implementation exist, but there are different compilation options within the same BLAS package that can change its performance. Here’s a table explaining each variation:

Option	Help
ilp64	This changes the integer sizes on things like matrix dimensions and tile sizes to use 64-bit integers. Many systems have 32-bit integers by default, which limits the maximum size a matrix can grow to
threads	pthreads and OpenMP are common ways to use multiple processor cores for a single matrix operation. With no threading, blas doesn’t use threads. Thus, only one processor core will be used for each blas call. That can be good if your code uses threads to make multiple blas calls simultaneously. If your code uses only one thread, though, you will want to have blas use threads.
static/shared	Shared object linking reduces the size of your code, but costs a tiny bit more for every call to BLAS, and introduces the potential for missing libraries if you ever reinstall anything. If your code makes millions of blas calls, you may want to try static linking.

Compile Your Own Blas/Lapack

For getting the most performance, it is recommended not to use a Linux package, but to use a vendor package or to compile your own BLAS.

Compiling your own can be extremely frustrating. Because of this, I recommend installing blas using spack. In fact, you can compile MPI, HPL, and HPCG using spack too. The search and info commands for spack are spack list hpl and spack info hpl.

Install HPL on OpenBlas and OpenMPI

Use spack to install the openblas and openmpi libraries. Then, instruct spack to use those two as dependencies when installing HPL.
Solution
cd $HOME
git clone --depth=1 https://github.com/spack/spack.git
source spack/share/spack/setup-env.sh
spack install openblas threads=openmp
spack install openmpi +cuda fabrics=auto
spack spec -lINt hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto
spack install -v hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto

Pay special attention to the configure and compile lines spit out using spack install -v. These can be helpful guides if you later decide to run the compile and install process yourself rather than let spack handle it.

Run The Solve Test Program

Now that everything is installed, you should be able to run HPL. But first, let’s try and compile the lapack-dgetrs-vs-dgesv program starting out this lesson.

The key idea is to provide the directory containing lapack to the compiler during linking.
Solution
BLAS=$(spack find --format '{prefix}' openblas)
g++ -o solve solve.cc -L$BLAS/lib -lopenblas
time ./solve 2048

You should explore the effect of varying OMP_NUM_THREADS and launching the above using MPI. It’s also important to note that this program is not using the GPU. Thus, a much faster solve could have been achieved if cublas were being called instead of openblas.

Key Points

Linking with vendor-optimized libraries is a pain in the neck.

Standard BLAS/LAPACK doesn’t use co-processors.

HPL

Overview

Teaching: 10 min
Exercises: 20 min

Questions

How do dense matrix algorithms divide work using MPI?

What are the key factors affecting their performance?

Why is peak performance a reasonable target for HPL?

Objectives

Setup and run the HPL benchmark.

Explain matrix tiling and scalapack layout.

We’ve covered blas and lapack, but have carefully avoided talking about blacs and scalapack. What’s the difference? Well, there’s a big difference. Scalapack is synonymous with linear algebra over dense matrices, distributed over MPI.

To divide up the work, scalapack divides up the matrix among processors. It does that using a very specific “scalapack layout”, that you should understand if you want to tune HPL. First, scalapack thinks of a matrix as a collection of blocks of size MB by NB. To picture this, imagine the map of a city laid out on a regular grid - like Albuquerque. The matrix elements are the buildings, and blocks are, well, entire city blocks. So every matrix element is assigned to one block, and we stop worrying about the elements and only talk about the blocks.

Next, the blocks get assigned to processors. This is done in round-robin fashion, processor 1 gets block 1, 2 gets block 2, and so on until processor P gets block P. Then we start at the beginning and block P+1 goes to processor 1 again. Think of this as a circular mapping. Matrices have 2 dimensions, so the processor layout is actually P by Q. The assignment happens independently in each dimension, so block (m,n) gets assigned to (m mod P, n mod Q).

This block division makes sense because now data exchange and element-wise operations are done on entire blocks at a time - so processors can leverage vectorization and economies of scale.

Run HPL

Running HPL isn’t all that different than the above. Consult advancedclustering and the HPL calculator to create an input file for HPL and modify the processor layout (p,q), block size (nb), and total number of blocks (n/nb and m/nb).

Next, create a node-file listing the nodes available to run, and launch HPL with mpirun.
Solution
export OMP_NUM_THREADS=4
mpirun -nodefile nodes -np 16 hpl hpl.dat

At this point, you’ve gone through the basics of installing mpi and blas, and testing both for local speed, and invoking HPL on-top of those libraries.

However, there is still a lot of work left. We haven’t covered using the GPU with HPL. To do that, you’ll want to look for NVIDIA’s optimized HPL code.

In general, you will want to arrive at a tuned benchmark run by following these steps:

Find the best MPI and CPU BLAS configuration by compiling many different ways and testing with micro-benchmarks.
Find the best block-size parameter (NB) for HPL by running with P=Q=1 and N=8192 (single MPI rank).
Find the best matrix size (N) using P=Q=1 (single MPI rank), and keeping N a multiple of NB. The total matrix memory should fill about 80-95% of the memory available to each MPI rank.
Use all available processors, with one MPI rank per processor. Try a range of P and Q values where P * Q = (total MPI ranks).
Investigate hardware-specific methods for overclocking while monitoring power consumption.
Iterate by trying small changes and looking for performance gains.

HPL Parameters

An example HPL.dat parameter file is below:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
          device out (6=stdout,7=stderr,file)
          # of problems sizes (N)
      Ns
          # of NBs
4096    NBs
          PMAP process mapping (0=Row-,1=Column-major)
          # of process grids (P x Q)
3          Ps
2          Qs
0         threshold
          # of panel fact
          PFACTs (0=left, 1=Crout, 2=Right)
          # of recursive stopping criterium
          NBMINs (>= 1)
          # of panels in recursion
          NDIVs
          # of recursive panel fact.
          RFACTs (0=left, 1=Crout, 2=Right)
          # of broadcast
          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
          # of lookahead depth
          DEPTHs (>=0)
          SWAP (0=bin-exch,1=long,2=mix)
         swapping threshold
          L1 in (0=transposed,1=no-transposed) form
          U  in (0=transposed,1=no-transposed) form
          Equilibration (0=no,1=yes)
          memory alignment in double (> 0)

The syntax of the file is line-based, with editable parameters on the left, followed by explanatory comments on the right. HPL only uses square matrixes and tiles, so you only set N and NB (respectively). It’s fine to leave output going to stdout and just use xhpl | tee HPL.log to save a copy to file. Obviously the processor layout has to use all MPI ranks.

The remaining parameters deal with properties of the LU factorization algorithm used. PFACT and RFACT deal with the direction of progress through the matrix (right- or left-moving). BCASTs and DEPTHs change the patterns used to communicate tiles between ranks. The transpose options change the storage (and thus access order) of the LU decomposition products, L1 and U. Equilibration refers to running an initial computation to “wake up” the hardware. Memory alignments larger than 8 (but still a power of 2) might help.

To understand the parameters completely, you can refer to the following reference: Dongarra, Faverge, Ltaief and Luszczek, “Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting” Concurrency Computat.: Pract. Exper. (2013).

Key Points

HPL requires tuning matrix and tile sizes to achieve peak performance.

Weak scaling requires very large problem sizes for large supercomputers.

HPCG

Overview

Teaching: 8 min
Exercises: 10 min

Questions

How do sparse matrix algorithms divide work using MPI?

What are the key factors affecting their performance?

Why is peak network bandwidth a reasonable target for HPCG?

Objectives

Setup and run the HPCG benchmark.

Demonstrate the effect of task placement during job launch.

Now that you’ve completed the run of HPL, running HPCG should be a walk in the park.

spack install -v hpcg

You might want to experiment with different compilers. Some examples are provided by amd.

If you change into the bin directory spack installed HPCG into, you can run xhpcg using the default parameter file. It outputs two summary files. The one starting HPCG-Benchmark contains the most run details.

Key summary information is contained in the sections marked Performance Summary, which include run-time, bandwidth (GB/s), and compute throughput (GFLOP/s).

What Happened?

The HPCG input file doesn’t say a lot about the underlying algorithm, but what’s happening underneath are a series of computational tasks that are memory-bound.

Memory-bound tasks are things like “scale a vector by 4” or “add two vectors.” These require relatively little computation, but much more data-movement.

When solving a sparse linear system, matrix-vector multiplications can be computed with the help of just a few elements from neighboring processors. The conjugate gradient algorithm is just a series of matrix multiplies and dot products that bring the answer closer to the solution. Thus, the timings for the overall solve are reflecting this memory bottleneck happening at each step.

HPCG Input File

The HPCG input file, hpcg.dat is comparatively simple. The two lines provide problem size (3D) and run-time (seconds). The run-time is approximate, and running longer usually provides higher performance because slow steps during startup contribute less to the average.

The configuration file sets problem size as number of points along x, y, and z-dimensions. The memory requirement for HPCG should scale proportionately to their product, nx*ny*nz. The second line contains the approximate target run-time in seconds.

For your final reported HPCG time, your output file will need to report that the memory used was at least 25% of the system’s total memory and the run-time was at least 1800 seconds.

NUMA

Non-uniform memory access is an architecture for cpu/memory connection used in modern compute platforms. It uses “fast” connections between CPUs and associated memory regions. In order to achieve peak performance, MPI ranks and processing threads should be assigned to processors consistently (this assignment is called process and thread affinity). This is necessary because the Linux kernel, by default, moves processes between CPUs to achieve load balance.

When using a batch job scheduler like SLURM, the scheduler is able to call the hwloc library and use flags to srun in order to setup processor binding for every MPI rank.

On a simple cluster without SLURM, you can launch jobs directly using mpirun, which can use ssh and a host-file. Some versions of mpirun, like openmpi, help when deailing with thread affinity. If you decide to tinker with calling numactl directly, you can replace mpirun xhpl with mpirun xhpl.sh, and write a script, xhpl.sh to setup environment variables and numa options. Some useful guides are here: http://www.hpc.acad.bg/numactl/, https://glennklockwood.com/hpc-howtos/process-affinity.html.

Key Points

HPCG requires tuning job launch configurations (like NUMA) to achieve peak communication bandwidth.

Few-node performance should be an indicator of full-scale performance.

Summary

Overview

Teaching: 14 min
Exercises: 0 min

Questions

How can I demonstrate my awesome cluster configuration?

Objectives

Start a log of configuration steps and performance.

Organize team investigations using the scientific method.

TL; DR;

A lot of what we communicate has an answer-bias. However, trying to jump directly to the final answer makes it hard to learn why the answer works.

As a competition team, it’s extremely important to keep an organized record of your efforts. This helps both to keep team members informed about what everyone is working on, and also to document your team’s activities for writing your project report.

Implement Kanban

Watch the “Agile Methodologies” presentation and follow the associated hands-on tutorial section of the BSSW Tutorial.

Setting up Kanban

Spend some time setting up a team Kanban board. Either Trello or a project board on github will work just as well.

Create an Initial Issue List

By now, your team should have a lot of ideas to discuss: virtual machine setup, networking and hardware configuration, test benchmarks, and ideas for tweaking performance. Everyone should write down a TODO list locally, and bring it for a team discussion. As a team, you should interactively create a few tasks and assign them to the most energetic and enthusiastic members.

Intermezzo

Watch 12 Ways to Fool the Masses with Irreproducible Results for a fun look into the history of HPC performance reporting. You should come away from the video with a good understanding of the target audience for your report. The emphasis on documenting software versions, compile flags, and extra information needed for reproducing your work is especially helpful for doing a self-review on your work.

Follow-Up

Now your team is off to a great start. As you work through the competition, try and do the following:

Keep creating new issues for work you don’t have time to address at the moment.
Post status updates to the issues you are working on. Don’t be afraid to modify their descriptions to reflect what you’re really doing, or close issues with short explanations of why they are resolved (or, alternately, no longer important).
Discuss recently completed tasks at regular team meetings. Every completed task should have a paragraph or two of text explaining why the task is important, what you did, and what results you obtained.
Maintain a separate document with all these paragraph descriptions. Present this as part of your final report.

Reporting

The scientific method contains hypothesis generation, carrying out experiments, collecting data, and writing up results. Your progress on HPL and HPCG benchmarks will be graded based on the optimization ideas you generated and tested. The final report should give some context for the HPL/HPCG benchmarks and system hardware. Then have an organized discussion of the key steps you took to arrive at your final configuration. Things like initial “unoptimized” performance, steps to setup MPI, select blas libraries, threading, job launch, and NUMA configuration, and tile-size optimization should be reported. You may choose to write these as a list of trials with with separate “initial” and “final” timings for each or as a sequential improvement after each configuration change.

Every controlled test is important! Even tests that result in decreased performance will contribute well to your overall evaluation as long as they are well explained and the results are documented.

The report should have a table listing the compuational throughput (“GFLOPS”) reported by HPL and HPCG benchmarks for key steps along your configuration journey. It should also include “theoretical peak GFLOPS” estimated based on the hardware documentation.

Key Points

Cluster configuration has reproducibility challenges.

Spend extra time to plan well-defined performance tests.

Using GPUs

Overview

Teaching: 2 min
Exercises: 25 min

Questions

What modifications to the benchmarks are required to utilize GPUs?

Objectives

Demonstrate methods for using GPUs with HPL and HPCG.

Using GPUs is more difficult than CPU-only codes because separate compilers are needed, and because computations run on the GPU are “offloaded” from the CPU. An offloaded computation is a function that is packed up and sent to the GPU, executed there (usually while the CPU does something else), and then eventually waited on by the CPU.

Importantly for this exercise, the standard HPL code distributed by netlib does not include support for running computations on GPUs. Instead, hardware vendors have supplied these themselves. NVIDIA offers custom programs that run HPL through its (free) developer program. More recently, they have also created a container running HPL and HPCG.

Rather than work with these, I suggest trying the free alternative HPL-GPU or (older) HPL-CUDA. The second was easier to compile. In addition to following the instructions, you should also set the TOPdir variable to the top directory.

To make the most performant binary, it is important to enable (or at least check) what hardware-optimizations exist. For CPU-s it’s well-known that you can check the processor options from /dev/cpuinfo. For GPUs, the most important property is the compute capability. A quick reference chart is here.

Key Points

Using GPUs requires modifications to the standard HPL and HPCG programs.

Using Infiniband

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How does infiniband hardware differ from ethernet?

Objectives

Provide tools, diagnostics, and references on setting up infiniband communication.

Infiniband hardware devices were created to support high-performance computing applications by “short-cutting” around the TCP/IP network stack. In traditional messaging, small packets are sent between two end-points that perform a complicated connection handshake. This involves a lot of buffering and queue wait times, which slow things down. Infiniband hardware provides a remote-direct memory access mechanism for hosts to communicate with one another with fewer requirements around acknowledgments and message sizes.

This comes at a cost. Infiniband devices have their own network addressing mechanism distinct from (but similar to) ethernet MAC addresses. Also, they require special drivers, network configuration utilities, and API calls.

The API calls are simplest, since they are usually handled for you by an MPI library. Those MPI libraries mostly use either the ibverbs interface or else something like UCX built on top of ibverbs. You’ll run into those keywords as you install MPI.

Installing Infiniband

The other two steps are complicated. Here are some notes that could be helpful built by following the RedHat Infiniband Guide and rdmamojo.

# allow users to pin unlimited memory
sudo sh -c "cat >/etc/security/limits.d/rdma.conf" <<.
@users    soft    memlock     unlimited
@users    hard    memlock     unlimited
.
# add self to the users group
sudo usermod -a -G users $USER

# setup the IB manager
sudo yum install -y opensm libibverbs-utils infiniband-diags
sudo systemctl enable --now opensm

# Testing
ibstat # locate Port GUID-s
sudo ibswitches # list switches
sudo sminfo # list sm communication info

ibv_devices
ibv_devinfo mlx4_0

Next, test communication between two hosts. For example, 10.52.3.83 and 10.52.3.178.

On one side, run

ibv_rc_pingpong

On abother,

ibv_rc_pingpong 10.52.3.178

Installing MPI

Next, installing MPI using infiniband is easy using spack:

git clone --depth=1 https://github.com/spack/spack.git
spack install -v mpich netmod=ucx

create a hostfile listing nodes, 1 per line:

# mpi hostfile
10.52.3.83
10.52.3.178

Then run a test application enabling the ucx fabric

BIN=$BIN/bin/mpirun -n 48 -env UCX_NET_DEVICES=mlx4_0:1 -f hostfile hostname

Next, run HPL and check speed:

spack install -v hpl ^mpich netmod=ucx # doesn't seem to activate infiniband
spack install -v hpl ^openmpi fabrics=verbs

Remember, for each run, you should create a separately named input and run-script,

MPIRUN=/home/cc/spack/opt/spack/linux-centos8-haswell/gcc-8.3.1/openmpi-4.1.1-6ykqelh2aqoxk6bbmdp7o63xi33puppg/bin/mpirun
HPL=/home/cc/spack/opt/spack/linux-centos8-haswell/gcc-8.3.1/hpl-2.3-cce3avecwhxksw6eysaklvuipvb2cdo4/bin/xhpl

export btl_openib_allow_ib=true
$MPIRUN -n 48 --mca btl openib,self,vader --mca btl_openib_allow_ib true --hostfile ../hostfile $HPL | tee HPL.out

Key Points

Infiniband networks do not carry standard ethernet traffic, requiring special .

Scripting OpenStack

Overview

Teaching: 5 min
Exercises: 20 min

Questions

Objectives

Provide tools for launching OpenStack instances on TACC.

For reference, here is a set of scripts useful for spinning up servers on Chameleon Cloud from the command-line.

There are no scripts to tear-down your instances. Instead, this can be done on the web-UI Don’t forget to also delete your leases under the “reservations” tab.

It relies on setting up the environment as follows:

Initial setup

install Chameleon’s blazar (pip3 install git+https://github.com/ChameleonCloud/python-blazarclient.git@chameleoncloud/stable/train)
install openstack (aptitude install python3-openstackclient)
Login to the CHI@TACC website. The following steps refer to this site.
Generate an “Application Credential” using the “Identity” section of the web GUI
- Click the “Unrestricted Access” checkbox (needed for creating reservations from the CLI).
- Save the credential as both openrc.sh and clouds.yaml
- run chmod 700 openrc.sh and chmod 600 clouds.yaml
Create a reservation using the Reservations section of the web GUI.
Create an “Instance” for that reservation using the “Compute” section of the web GUI.
- This will prompt you to generate an SSH keypair. Save the private key as ssh-id.rsa and run chmod 600 ssh-id.rsa
- Name the keypair “ChameleonSSH” so that you know the name to use when accessing from scripts.
Create a “Floating IP” using the “Network” section of the web GUI
Use the same section to “Associate” the IP with your instance.
Connect to the instance via ssh: ssh -i ssh-id.rsa cc@IP address

At this point, you can use the web interface to close down all your servers and leases. Alternately, you could follow the side-track below to explore some of the setup tasks coming up.

Side-track, exploring single-node setup

follow the OSE MPI Benchmarks Guide:

 $ mpirun -n 48 ~/inst/libexec/osu-micro-benchmarks/mpi/collective/osu_gather

 # OSU MPI Gather Latency Test v5.7.1
 # Size       Avg Latency(us)
                     1.48
                     1.51
                     1.54
                     1.55
                    1.60
                    1.69
                    1.74
                   1.86
                   2.06
                   2.30
                  3.87
                  4.87
                  7.45
                 12.16
                23.92
                49.21
                97.32
 131072                180.46
 262144                394.33
 524288               1097.02
 1048576              2186.11

 $ mpirun -n 24 ~/inst/libexec/osu-micro-benchmarks/mpi/collective/osu_gather

 # OSU MPI Gather Latency Test v5.7.1
 # Size       Avg Latency(us)
                     1.14
                     1.18
                     1.20
                     1.21
                    1.27
                    1.28
                    1.33
                   1.41
                   1.56
                   1.73
                  3.10
                  3.84
                  5.52
                  9.71
                17.74
                34.82
                64.14
 131072                116.71
 262144                227.19
 524288                705.59
 1048576              1580.68

follow the HP Linpack guide:
- Download HPL
give up and use spack spack install gcc@11.1.0 spack install hpl ^blis https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/ ~> HPL.dat

HPL result:

  T/V                N    NB     P     Q               Time                 Gflops
 --------------------------------------------------------------------------------
 WR11C2R4      115200   192     6     8            3417.23             2.9826e+02
 HPL_pdgesv() start time Mon May 24 01:41:36 2021

 HPL_pdgesv() end time   Mon May 24 02:38:33 2021

 --------------------------------------------------------------------------------
 ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.22081836e-03 ...... PASSED
 ================================================================================

 Finished      1 tests with the following results:
               1 tests completed and passed residual checks,
               0 tests completed and failed residual checks,
               0 tests skipped because of illegal input values.
 --------------------------------------------------------------------------------

 End of Tests.
 ================================================================================

Next, try www.hpcg-benchmark.org.
IMPORTANT: create an password-less ssh key and add it to your allowed-hosts file
```
ssh-keygen -t ed25519
cp ~/.ssh/id_ed25519.pub ~/.ssh/authorized_keys
```
this will allow you to login to other nodes when you start this image.

Also important - keep a log of the setup steps you used to create this image in order to share with your team, reproduce, and change later.
Follow the documentation to save a snapshot of your image when done (using sudo cc-snapshot <image_name>).

Makefile

This Makefile stores the series of steps needed to start all the infrastructure needed to launch a set of servers.

It is intended to be used for launching a single node on which software is installed. Later, this node can be packed into a disk image and launched across several server machines.

Obviously, running the commands here requires that their directory is in your PATH.

# Makefile
NAME = install
MINUTES = 120 # 2 hours

default: $(NAME)-server.ip


$(NAME)-server.ip: $(NAME)-server.json $(NAME)-ip.json
        associate_ip.sh $(NAME)-server

$(NAME)-server.json: $(NAME)-lease.json
        start_server.sh $(NAME)-lease $(NAME)-server

$(NAME)-ip.json:
        lease_ip.sh $(NAME)-ip $(MINUTES)

$(NAME)-lease.json:
        lease_server.sh $(NAME)-lease 1 $(MINUTES)

clean:
        rm -f trial-*.json

Note: if you have an existing lease, create its json using bin/activate_lease.sh <name of lease>. Then, following the Makefile above, rename it to install-lease.json. Then you can use the above Makefile with your existing lease.

Individual commands

Each of the commands below sets up one piece of server infrastructure by launching the appropriate OpenStack command. Several make use of the following python script to read values from a json-file:

#!/usr/bin/env python3
# read_val.py

import json

def first(x):
    def run(test):
        for xi in x:
            if test(xi):
                return xi
    return run

def main(argv):
    assert len(argv) == 3, f"Usage: {argv[0]} <file.json> <key>"
    with open(argv[1], 'r', encoding='utf-8') as f:
        x = json.loads( f.read() )

    if isinstance(x, dict):
        local = x
    else:
        local = {'x': x, 'first': first(x)}
    print( eval(argv[2], {'json':json}, local) )

if __name__=="__main__":
    import sys
    main(sys.argv)

#!/bin/bash
# ../bin/activate_lease.sh
# Read the status of a lease object until it is active.
# Store the result at <lease name>.json
# Exits with an error-code if the lease object is missing
# or not active within 30 seconds.

set -e

if [ $# -ne 1 ]; then
    echo "Usage: $0 <name>"
    exit 1
fi

name="$1"

# openstack server show -f json interactive-server

# poll until lease is active
for((i=0;i<30;i++)); do
    blazar lease-show -f json "$name" >"$name.json"
    status=$(read_val.py "$name.json" "status")
    [[ x"$status" == x"ACTIVE" ]] && break
    sleep 1
done
if [[ x"$status" != x"ACTIVE" ]]; then
    echo "Lease not activated."
    exit 1
fi

#!/bin/bash
# ../bin/activate_server.sh
# Read the status of a server object until it is active.
# Store the result at <server name>.json
# Exits with an error-code if the server object is missing
# or not active within 600 seconds (10 minutes).

set -e

if [ $# -ne 1 ]; then
    echo "Usage: $0 <name>"
    exit 1
fi

name="$1"

# poll until active
for((i=0;i<60;i++)); do
    openstack server show -f json "$name" >"$name.json"
    status=$(read_val.py "$name.json" "status")
    echo "Server Status = $status"
    [[ x"$status" == x"ACTIVE" ]] && break
    sleep 10
done
if [[ x"$status" != x"ACTIVE" ]]; then
    echo "Server not activated."
    exit 1
fi

#!/bin/bash
# ../bin/associate_ip.sh
# Associate a leased, floating IP with a server

set -e

# FIXME: there's no way to go from an IP reservation
# name to an actual IP address!
if [ $# -ne 1 ]; then
    echo "Usage: $0 <server-name>"
    exit 1
fi

#lease="$1"
server="$1"

if [ ! -s "$server.json" ]; then
    if [ -s "$server-1.json" ]; then
        server="$server-1"
    else
        echo "File not found: $server.json"
        exit 1
    fi
fi
server_id=$(read_val.py "$server.json" id)

iplist=`mktemp`
openstack floating ip list -f json >>$iplist
ip_addr=$(read_val.py "$iplist" 'first(lambda z: z["Port"] is None)["Floating IP Address"]')
rm -f $iplist

echo "Associating $ip_addr to $server"
#openstack floating ip add "$ip_addr" "$server_id"
#openstack floating ip set --port "$server_id" "$ip_addr"
openstack server add floating ip "$server_id" "$ip_addr"

echo "$ip_addr" >"$server.ip"

#!/bin/bash
# ../bin/lease_ip.sh
# Create a floating IP lease for the next <n> minutes.

set -e

if [ $# -ne 2 ]; then
    echo "Usage: $0 <name> <minutes>"
    exit 1
fi

name="$1"
n="$2"

start=$(TZ="UTC" date --date 'now' "+%Y-%m-%d %H:%M")
end=$(TZ="UTC" date --date "now+$n minutes" "+%Y-%m-%d %H:%M")
echo "Creating IP lease from $start to $end (UTC)"

PUBLIC_NETWORK_ID=$(openstack network show public -c id -f value)
blazar lease-create \
  --reservation resource_type=virtual:floatingip,network_id=${PUBLIC_NETWORK_ID},amount=1 \
  --start-date "$start" \
  --end-date "$end" \
  "$name"

activate_lease.sh "$name"

#!/bin/bash
# ../bin/lease_server.sh
# Create a node lease for the next <n> minutes.

set -e

if [ $# -ne 3 ]; then
    echo "Usage: $0 <name> <number> <minutes>"
    exit 1
fi

name="$1"
n="$2"
t="$3"

if [ $n -le 0 ]; then
    echo "Invalid number: $n"
    exit 1
fi
if [ $t -le 0 ]; then
    echo "Invalid time: $t"
    exit 1
fi

start=$(TZ="UTC" date --date 'now' "+%Y-%m-%d %H:%M")
end=$(TZ="UTC" date --date "now+$t minutes" "+%Y-%m-%d %H:%M")
echo "Creating node lease from $start to $end (UTC)"

# --physical-reservation min=$n,max=$n,resource_properties='["and", ["=", "$infiniband", "True"], [">=", "$gpu.gpu_count", "1"]]' \
# --physical-reservation min=$n,max=$n,resource_properties='["==","$node_name","c01-18"]' \
blazar lease-create \
  --start-date "$start" \
  --physical-reservation min=$n,max=$n,resource_properties='["=", "$node_type", "compute_haswell"]' \
  --end-date "$end" \
  "$name"

# Notes:
# - directly read status from lease
#   status=$(blazar lease-show "$name" -c status -f value)
# - extend an existing lease (TODO)
#   blazar lease-update --end-date "$end" "$name"
# - list all floating IP-s
#   openstack floating ip list
# - list all servers
#   openstack server list

activate_lease.sh "$name"

#!/usr/bin/env python3
# ../bin/read_val.py

import json

def first(x):
    def run(test):
        for xi in x:
            if test(xi):
                return xi
    return run

def main(argv):
    assert len(argv) == 3, f"Usage: {argv[0]} <file.json> <key>"
    with open(argv[1], 'r', encoding='utf-8') as f:
        x = json.loads( f.read() )

    if isinstance(x, dict):
        local = x
    else:
        local = {'x': x, 'first': first(x)}
    print( eval(argv[2], {'json':json}, local) )

if __name__=="__main__":
    import sys
    main(sys.argv)

#!/bin/bash
# ../bin/start_server.sh
# Start a server on the given reservation.

set -e

if [ $# -ne 2 ]; then
    echo "Usage: $0 <lease-name> <insance-name>"
    exit 1
fi

lease="$1"
instance="$2"

res_id=$(read_val.py "$lease.json" "json.loads(reservations)['id']")
min=$(read_val.py "$lease.json" "json.loads(reservations)['min']")
max=$(read_val.py "$lease.json" "json.loads(reservations)['max']")
#blazar lease-show interactive-lease

net_id=$(openstack network list --name sharednet1 -c ID -f value)

openstack server create \
  --image CC-CentOS8 \
  --flavor baremetal \
  --key-name ChameleonSSH \
  --nic net-id="$net_id" \
  --hint reservation="$res_id" \
  --min $min \
  --max $max \
  "$instance"

if [ $min -eq 1 ]; then
  activate_server.sh "$instance"
else
  for((i=1;i<=min;i++)); do
    activate_server.sh "$instance-$i"
  done
  touch "$instance.json"
fi

Example Configurations

#!/usr/bin/env bash
# openrc.sh, mostly generated by Chameleon Cloud.

export OS_AUTH_TYPE=v3applicationcredential
export OS_AUTH_URL=https://chi.tacc.chameleoncloud.org:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_REGION_NAME="CHI@TACC"
export OS_INTERFACE=public
export OS_APPLICATION_CREDENTIAL_ID=long-string-with-the-credential
export OS_APPLICATION_CREDENTIAL_SECRET=long-string-with-the-secret
export PATH=$PATH:$PWD/bin

Key Points