Introduction
Overview
Teaching: 10 min
Exercises: 0 minQuestions
Why are HPL and HPCG used as benchmarks?
What is a linear solve used for?
Objectives
Explain dense and sparse matrices.
List some applications that generate these types of problems.
High-Performance Linpack (HPL) and High-Performance Conjugate-Gradient (HPCG) are two of the most famous benchmarks for supercomputers. As a benchmark, they are intended to show the maximum achievable performance of the parallel machine on an important problem with real-world applications.
Where does the problem come from?
Both benchmarks solve for $X$ given $A$ and $B$ in the matrix-vector problem(s),
$A X = B, A \in \mathcal M_{n\times n}, X,B \in \mathcal M_{n\times k}$
The notation $\mathcal M_{n \times k}$ stands for an n by k matrix. The matrix $A$ is square, and is sometimes referred to as the “operator”. Its properties are very important in determining the algorithm used to solve for $X$. Each of the $k$ columns of the matrix $B$ are called “right-hand side vectors”.
Because of the way matrix multiplication is defined, each “right-hand side vector” corresponds to a column of $X$. This column of $X$ is also called the “unknown vector” or “solution vector”. Thus, $AX=B$ represents $k$ independent matrix-vector problems. We use lower-case $x$,$b$ to refer to a single $\mathcal M_{n\times 1}$ vector.
Matrix-vector problems like the above come from trying to solve linear systems of equations, like the Helmholtz equation,
$(\nabla^2 + k^2) X(r) = B(r)$
which represents the scattering of a sound-wave off of a surface, or a quantum wavefunction at a specific energy. Discretizing the field $X(r)$ by storing only some points, $X_i = X(r_i)$, leads to the matrix-vector form.
In this case, the matrix size (also known as dimension), $N$, is the number of sampled real-space points, $r_i$. In many applications, linear equations appear at each time-step, or as a Newton step toward solving a non-linear equation. It’s not difficult to see larger problem domains and more dimensions lead to very large linear systems with $N^2$ being as large as an entire supercomputer’s working memory!
HPCG and Sparse Matrices
The properties of the operator, $A$, determine the most efficient algorithm to use when solving a linear system. High-Performance Linpack solves dense linear systems and High-Performance Conjugate Gradient solves sparse ones.
A sparse matrix is one in which only order N ($O(N)$) elements are non-zero. The Helmholtz equation above is sparse because multiplication by $k^2$ and taking the derivative, $\nabla^2$ are only local operations. This means only a constant number of elements in $X$ are needed to compute $(A X)_i$. In one-dimension,
[A = \begin{pmatrix}
(k\Delta x)^2-2 & 1 & 0 & 0 & 1
1 & (k\Delta x)^2-2 & 1 & 0 & 0
0 & 1 & (k\Delta x)^2-2 & 1 & 0
0 & 0 & 1 & (k\Delta x)^2-2 & 1
1 & 0 & 0 & 1 & (k\Delta x)^2-2
\end{pmatrix} \Delta x^{-2}]
If processor one holds the first two elements of $x$ and processor two holds the last three elements of $x$, then the two only need to communicate one element of $x$ in order for both to compute their part of the product, $A x$.
The method of conjuate gradients uses the fact that $Ax - b = 0$ in the final solution to iteratively adjust $x$ by computing $A\tilde x - b$ ($\tilde x$ is an estimate of $x$). Thus, the HPCG benchmark focuses on this communicate and multiply rows strategy.
Applying the $A$ operator to the right-hand side, $\tilde x$ is done by a sparse matrix multiplication. Since each row has a constant number of non-zero elements, that sparse multiplication takes $O(N)$ work instead of $O(N^2)$ for a dense matrix. In a “perfect” parallel machine with $P$ processors, the multiplication can be done in $O(N/P)$ time.
However, the cost of using an iterative method is that the number of iterations needed to obtain convergence to machine precision is unknown. The convergence of conjugate gradients has been extensively studied, with the result that the number of iterations is somewhere between $O(\sqrt \kappa)$ and $O(N)$, with some algorithms guaranteeing less then $O(\sqrt \kappa)$ iterations. Here, $\kappa$ is the condition number of the matrix, equal to the ratio of the largest to smallest matrix eigenvalue. There is theoretical evidence that (for matrices with Gaussian random elements) the condition number is proportional to the matrix dimension, $N$. This means in practice that the number of iterations required for convergence is on the order of $\sqrt N$, and that conjugate gradients on a “perfect” parallel machine scales as $O(N^{3/2} / P)$.
HPL and Dense Matrices
Dense matrices, on the other hand, are matrices in which $O(N^2)$ elements are nonzero. In this case, much more communication would be needed to multiply $Ax$. Rather than use conjugate gradients, dense matrix solves are more effectively accomplished by performing row-operations on the matrix, $A$.
The computational method, however, is a bit different than doing Gaussian elimination by hand. Instead, we just need to apply enough row operations to make $A$ upper-diagonal (turning $A$ into $U$). The operations can be collected into a lower-diagonal matrix, $L$, and a permutation matrix, $P$. Each permutation swaps two rows, so $P$ is a matrix containing $N$ ones. The complete process is summarized as,
$ A = P L U $
where all matrices are $N \times N$. This is called a “factorization” of $A$ into an “L-U” decomposition. The computational work required for L-U decomposition is $O(N^3)$.
Given the factorization, $A x = b$ can be solved by step-wise solution:
$ x = (PLU)^-1 b = U^{-1} (L^{-1} (P^{-1} b)) $
Permutation is just a communication step. Each of the next two multiplications take just about the same amount of work as a matrix-multiplication, $O(N^3)$.
On a “perfect” parallel machine with $P$ processors, this should take $O(N^3/P)$ time.
Key Points
Linear systems in both dense and sparse form are a universal theme in scientific computing.
Dense and sparse matrices have different optimal algorithms.
Computational Kernels
Overview
Teaching: 10 min
Exercises: 30 minQuestions
What are memory and compute-bound kernels?
Objectives
Run the OSU MPI and Mixbench GPU benchmarks.
Understand bottlenecks to peak performance.
High-performance compute clusters make use of specialized network hardware in order to maximize message bandwidth and minimize latency. In most cases, this means specialized “network fabric” libraries from the hardware vendors need to be installed. Then, some version of the MPI library should be built “on top” of these vendor communication libraries.
This linking process can be confusing - so it’s important to read the vendor and MPI library instructions carefully. Then benchmark your MPI by using it to exchange data and timing it. If the timings don’t match what the vendor says can be done, it probably means you are not compiling MPI or your application with the right options.
It could also mean you are launching your MPI process incorrectly, causing the wrong assignment of MPI ranks to processors and other machine resources.
Run the OSU MPI Benchmarks
Download the latest version of the OSU Microbenchmarks. Extract, compile and link with your version of MPI, and run the one-sided osu_get_latency and osu_get_bw benchmarks using 2 MPI ranks on the same host.
Plot the results to show latency and bandwidth as a function of packet size.
Solution
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.7.1.tgz tar xzf osu-micro-benchmarks-5.7.1.tgz mkdir osu.default && cd osu.default ../osu-micro-benchmarks-*/configure CC=mpicc CXX=mpicxx CFLAGS=-I$(PWD)/../src/osu-micro-benchmarks-*/util --prefix=$PWD cd libexec/osu-micro-benchmarks/mpi/one-sided mpirun -n 2 ./osu_get_latency >latency.dat mpirun -n 2 ./osu_get_bw >bw.dat gnuplot <<. set term eps enh color lw 2 set out "bw.eps" set logscale x set logscale y set xlabel "Message Size / bytes" set ylabel "Bandwidth / MB/sec" plot "bw.dat" u 1:2 w l set out "latency.eps" set ylabel "Latency / us" plot "latency.dat" u 1:2 w l .
Now that we’ve covered how to test communication speed, it’s time to turn to local computation speed. Graphics processing units (GPUs) are constructed with matrix (image) processing in-mind. They achieve this by using very large vector-sizes and optimizing for performing single instruction multiple data (SIMD) operations while pre-fetching data from memory.
Reading and writing data to GPU memory takes many more clock cycles than performing operations like add, multiply, divide, on data that’s already present. This means that you have to do lots of operations on data that’s already sitting in the SIMD units in order for your kernel to be compute-bound.
The number of floating point operations done on every byte read is called “Arithmetic Intensity.” It controls the trade-off between reading a new batch of data and operating on the data that’s present.
If you do just one operation for every byte read, then the run-time will depend on the GPU’s maximum memory bandwidth. At some value of arithmetic intensity, however, you’ll be doing so many operations on every byte of data that the memory access will be completely hidden. In this case, the performance flattens to a “roofline”, at the peak floating point operations the processor is capable of performing.
You can see this behavior by running an invented kernel with an adjustable number of flops per byte, and plotting flops vs. arithmetic intensity.
Run Mixbench
Download the latest version of the Mixbench Microbenchmark and produce a “roofline” plot of achieved flops vs arithmetic intensity.
Solution
cd $HOME git clone https://github.com/ekondis/mixbench.git mkdir mixbench/build && cd mixbench/build cmake -DCMAKE_CUDA_ARCHITECTURES=80 ../mixbench-cuda make -j ./mixbench-cuda-alt | tee alt.log grep ^ *[0-9] alt.log* | sed -e s/,//g >alt.dat gnuplot <<. set term eps enh color lw 2 set out "roofline.eps" set xlabel "Arithmetic Intensity / FLOP/BYTE" set ylabel "Bandwidth / MB/sec" plot "bw.dat" u 2:4 w l title "single" plot "bw.dat" u 6:8 w l title "double" .
Peak performance
Both of the micro-benchmarks above show how communicating data can easily become a bottleneck for algorithm run-time. Even the very fast memory movement between GPU memory and GPU processors can occupy the time of hundreds of floating point operations.
When part of a matrix or vector needs to be communicated between MPI ranks before a solution step can proceed, then the whole computation runs the risk of running at the same speed as the network bandwidth. Even small-data operations like broadcasting the result of a local vector-vector dot product incur a latency cost.
To write an effective parallel algorithm, data movement must be overlapped with computation as much as possible. Further, scattering and gathering data between MPI ranks should make use of faster links between ranks on the same node as much as possible.
Finally, all these operations should be done as explicitly as possible so that the processors access local memory. Asking two processors to implicitly be aware of memory updates from one another causes the processes to interrupt each other. Assigning every MPI rank to a specific processor and bank of memory (using system NUMA controls) prevents these hidden interruptions.
Check for NUMA Issues
Add MPI init, send/receive, and finalize calls to mixbench.
Create a batch script that launches this MPI job over your entire cluster. Check the point-to-point bandwidth achieved between your send/receive calls and make sure that every rank is getting full use of the GPU.
If you have issues at this stage, it’s likely that your mpirun launch configuration is incorrect.
Key Points
Most applications are memory-bound, which complicates parallelization.
Compute-bound applications depend on peak theoretical flops.
Getting good performance from parallel solvers is hard.
BLAS
Overview
Teaching: 10 min
Exercises: 20 minQuestions
What blas should I link with?
How can I get all these libraries installed and working together?
What are the limitations of blas and lapack?
Objectives
Understand the differences between blas implementations.
Correctly compile, link, and test linear algebra libraries.
The compilers, GPU, and network are working well. Now it’s time to run a real high-intensity problem.
The following code shamelessly adapted from stack-overflow uses the lapack library to solve a small matrix-vector problem. It’s the single-node, CPU version of the problem HPL solves.
// https://stackoverflow.com/questions/36063993/lapack-dgetrs-vs-dgesv
#include <stdio.h>
#include <vector>
#include <stdlib.h>
using namespace std;
extern "C" {
void daxpy_(int* n,double* alpha,double* dx,int* incx,double* dy,int* incy);
double dnrm2_(int* n,double* x, int* incx);
void dgetrf_(int* M, int *N, double* A, int* lda, int* IPIV, int* INFO);
void dgetrs_(char* C, int* N, int* NRHS, double* A, int* LDA, int* IPIV, double* B, int* LDB, int* INFO);
void dgesv_(int *n, int *nrhs, double *a, int *lda, int *ipiv, double *b, int *ldb, int *info);
}
void print(const char* head, int m, int n, double* a){
printf("\n%s\n---------\n", head);
for(int i=0; i<m; ++i){
for(int j=0; j<n; ++j){
printf("%f ", a[i+j*m]);
}
printf("\n");
}
printf("---------\n");
}
int main(int argc, char *argv[]) {
int dim = 5;
if(argc > 1)
dim = atoi(argv[1]);
vector<double> a(dim * dim);
vector<double> b(dim);
srand(1); // seed the random # generator with a known value
double maxr = (double)RAND_MAX;
for (int r = 0; r < dim; r++) { // set a to a random matrix, i to the identity
for (int i = 0; i < dim; i++) {
a[r + i*dim] = rand() / maxr;
}
b[r] = rand() / maxr;
}
int info;
int one = 1;
char N = 'N';
vector<int> ipiv(dim);
vector<double> a1(a);
vector<double> b1(b);
dgesv_(&dim, &one, a1.data(), &dim, ipiv.data(), b1.data(), &dim, &info);
print("B1",dim < 5 ? dim : 5,1,b1.data());
vector<double> a2(a);
vector<double> b2(b);
dgetrf_(&dim, &dim, a2.data(), &dim, ipiv.data(), &info);
dgetrs_(&N, &dim, &one, a2.data(), &dim, ipiv.data(), b2.data(), &dim, &info);
print("B2",dim < 5 ? dim : 5,1,b2.data());
double d_m_one = -1e0;
daxpy_(&dim,&d_m_one,b2.data(),&one,b1.data(),&one);
printf("\ndiff = %e\n", dnrm2_(&dim,b1.data(),&one));
return 0;
}
To actually compile and use this, however, you’ll need to install a BLAS and LAPACK library. This is where the actual fun begins. There are many, many different implementations of BLAS available.
-
On Mac OSX, you can use the system-wide installation by adding the compile flag
-framework Accelerate
. -
On Linux, you’ll find libopenblas, libblas, libatlas, and libgslcblas. All three provide different implementations of the same BLAS functions. Both openblas and atlas include lapack. But, if you install libblas or libgslcblas, you don’t get the lapack functions. Instead, liblapack is a separate Linux package compatible only with libblas.
-
Intel offers BLAS/LAPACK through their Math Kernel Library (MKL). There are so many different variations provided by MKL that Intel provides a link line advisor to find the exact library names to link with depending on your variation. Note, the most common variation is for 64-bit processors with 32-bit integers and openmp.
-
AMD has written its own BLAS optimized for AMD platforms called blis. Blis has a lapack implementation called libFLAME.
-
Co-processor accelerators (GPUs and Xeon Phi cards) use separate blas libraries. For GPUs, these are cublas and rocblas. For Xeon Phi cards, the Intel MKL will work but special run-time setup is needed.
-
Cray systems provide BLAS and LAPACK through libsci. Their libsci-acc includes support for co-processor accelerators.
-
Both Intel MKL and Cray Libsci-Acc provide scalapack, which is an MPI-distributed version of lapack.
-
Rather than using blas and lapack, I recommend that new C++ codes target blaspp and lapackpp. These packages wrap the old Fortran blas and lapack interfaces into a nice templated C++ design. They are only wrappers, however, so installing them requires pointing them to an already-installed version of blas and lapack.
Information about the Linux packages can be obtained
on Debian-based systems by querying the package manager
with commands like aptitude search blas
, and
aptitude show liblapack3
.
On Redhat-based systems, use the corresponding
yum search
and yum info
.
Blas Options
Not only do different implementation exist, but there are different compilation options within the same BLAS package that can change its performance. Here’s a table explaining each variation:
Option | Help |
---|---|
ilp64 | This changes the integer sizes on things like matrix dimensions and tile sizes to use 64-bit integers. Many systems have 32-bit integers by default, which limits the maximum size a matrix can grow to |
threads | pthreads and OpenMP are common ways to use multiple processor cores for a single matrix operation. With no threading, blas doesn’t use threads. Thus, only one processor core will be used for each blas call. That can be good if your code uses threads to make multiple blas calls simultaneously. If your code uses only one thread, though, you will want to have blas use threads. |
static/shared | Shared object linking reduces the size of your code, but costs a tiny bit more for every call to BLAS, and introduces the potential for missing libraries if you ever reinstall anything. If your code makes millions of blas calls, you may want to try static linking. |
Compile Your Own Blas/Lapack
For getting the most performance, it is recommended not to use a Linux package, but to use a vendor package or to compile your own BLAS.
Compiling your own can be extremely frustrating.
Because of this, I recommend installing blas
using spack.
In fact, you can compile MPI, HPL, and HPCG using
spack too. The search and info commands
for spack are spack list hpl
and spack info hpl
.
Install HPL on OpenBlas and OpenMPI
Use spack to install the openblas and openmpi libraries. Then, instruct spack to use those two as dependencies when installing HPL.
Solution
cd $HOME git clone --depth=1 https://github.com/spack/spack.git source spack/share/spack/setup-env.sh spack install openblas threads=openmp spack install openmpi +cuda fabrics=auto spack spec -lINt hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto spack install -v hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto
Pay special attention to the configure and compile lines spit out
using spack install -v
. These can be helpful guides if you later
decide to run the compile and install process yourself
rather than let spack handle it.
Run The Solve Test Program
Now that everything is installed, you should be able to run HPL. But first, let’s try and compile the
lapack-dgetrs-vs-dgesv
program starting out this lesson.The key idea is to provide the directory containing lapack to the compiler during linking.
Solution
BLAS=$(spack find --format '{prefix}' openblas) g++ -o solve solve.cc -L$BLAS/lib -lopenblas time ./solve 2048
You should explore the effect of varying OMP_NUM_THREADS
and
launching the above using MPI. It’s also important
to note that this program is not using the GPU.
Thus, a much faster solve could have been achieved
if cublas were being called instead of openblas.
Key Points
Linking with vendor-optimized libraries is a pain in the neck.
Standard BLAS/LAPACK doesn’t use co-processors.
HPL
Overview
Teaching: 10 min
Exercises: 20 minQuestions
How do dense matrix algorithms divide work using MPI?
What are the key factors affecting their performance?
Why is peak performance a reasonable target for HPL?
Objectives
Setup and run the HPL benchmark.
Explain matrix tiling and scalapack layout.
We’ve covered blas and lapack, but have carefully avoided talking about blacs and scalapack. What’s the difference? Well, there’s a big difference. Scalapack is synonymous with linear algebra over dense matrices, distributed over MPI.
To divide up the work, scalapack divides up the matrix among processors. It does that using a very specific “scalapack layout”, that you should understand if you want to tune HPL. First, scalapack thinks of a matrix as a collection of blocks of size MB by NB. To picture this, imagine the map of a city laid out on a regular grid - like Albuquerque. The matrix elements are the buildings, and blocks are, well, entire city blocks. So every matrix element is assigned to one block, and we stop worrying about the elements and only talk about the blocks.
Next, the blocks get assigned to processors. This is done in round-robin fashion, processor 1 gets block 1, 2 gets block 2, and so on until processor P gets block P. Then we start at the beginning and block P+1 goes to processor 1 again. Think of this as a circular mapping. Matrices have 2 dimensions, so the processor layout is actually P by Q. The assignment happens independently in each dimension, so block (m,n) gets assigned to (m mod P, n mod Q).
This block division makes sense because now data exchange and element-wise operations are done on entire blocks at a time - so processors can leverage vectorization and economies of scale.
Run HPL
Running HPL isn’t all that different than the above. Consult advancedclustering and the HPL calculator to create an input file for HPL and modify the processor layout (p,q), block size (nb), and total number of blocks (n/nb and m/nb).
Next, create a node-file listing the nodes available to run, and launch HPL with mpirun.
Solution
export OMP_NUM_THREADS=4 mpirun -nodefile nodes -np 16 hpl hpl.dat
At this point, you’ve gone through the basics of installing mpi and blas, and testing both for local speed, and invoking HPL on-top of those libraries.
However, there is still a lot of work left. We haven’t covered using the GPU with HPL. To do that, you’ll want to look for NVIDIA’s optimized HPL code.
In general, you will want to arrive at a tuned benchmark run by following these steps:
- Find the best MPI and CPU BLAS configuration by compiling many different ways and testing with micro-benchmarks.
- Find the best block-size parameter (NB) for HPL by running with P=Q=1 and N=8192 (single MPI rank).
- Find the best matrix size (N) using P=Q=1 (single MPI rank), and keeping N a multiple of NB. The total matrix memory should fill about 80-95% of the memory available to each MPI rank.
- Use all available processors, with one MPI rank per processor. Try a range of P and Q values where P * Q = (total MPI ranks).
- Investigate hardware-specific methods for overclocking while monitoring power consumption.
- Iterate by trying small changes and looking for performance gains.
HPL Parameters
An example HPL.dat
parameter file is below:
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
24576 Ns
2 # of NBs
2048 4096 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 3 Ps
6 2 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
The syntax of the file is line-based, with editable parameters on the left,
followed by explanatory comments on the right. HPL only uses square matrixes
and tiles, so you only set N and NB (respectively).
It’s fine to leave output going to stdout and just use xhpl | tee HPL.log
to save a copy to file. Obviously the processor layout has to
use all MPI ranks.
The remaining parameters deal with properties of the LU factorization algorithm used. PFACT and RFACT deal with the direction of progress through the matrix (right- or left-moving). BCASTs and DEPTHs change the patterns used to communicate tiles between ranks. The transpose options change the storage (and thus access order) of the LU decomposition products, L1 and U. Equilibration refers to running an initial computation to “wake up” the hardware. Memory alignments larger than 8 (but still a power of 2) might help.
To understand the parameters completely, you can refer to the following reference: Dongarra, Faverge, Ltaief and Luszczek, “Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting” Concurrency Computat.: Pract. Exper. (2013).
Key Points
HPL requires tuning matrix and tile sizes to achieve peak performance.
Weak scaling requires very large problem sizes for large supercomputers.
HPCG
Overview
Teaching: 8 min
Exercises: 10 minQuestions
How do sparse matrix algorithms divide work using MPI?
What are the key factors affecting their performance?
Why is peak network bandwidth a reasonable target for HPCG?
Objectives
Setup and run the HPCG benchmark.
Demonstrate the effect of task placement during job launch.
Now that you’ve completed the run of HPL, running HPCG should be a walk in the park.
spack install -v hpcg
You might want to experiment with different compilers. Some examples are provided by amd.
If you change into the bin directory spack installed HPCG into,
you can run xhpcg
using the default parameter file.
It outputs two summary files. The one starting HPCG-Benchmark
contains the most run details.
Key summary information is contained in the sections
marked Performance Summary
, which include run-time,
bandwidth (GB/s), and compute throughput (GFLOP/s).
What Happened?
The HPCG input file doesn’t say a lot about the underlying algorithm, but what’s happening underneath are a series of computational tasks that are memory-bound.
Memory-bound tasks are things like “scale a vector by 4” or “add two vectors.” These require relatively little computation, but much more data-movement.
When solving a sparse linear system, matrix-vector multiplications can be computed with the help of just a few elements from neighboring processors. The conjugate gradient algorithm is just a series of matrix multiplies and dot products that bring the answer closer to the solution. Thus, the timings for the overall solve are reflecting this memory bottleneck happening at each step.
HPCG Input File
The HPCG input file, hpcg.dat
is comparatively simple.
The two lines provide problem size (3D) and run-time (seconds).
The run-time is approximate, and running longer usually
provides higher performance because slow steps during
startup contribute less to the average.
The configuration file sets problem size as number of points
along x, y, and z-dimensions. The memory requirement for HPCG
should scale proportionately to their product, nx*ny*nz
.
The second line contains the approximate target run-time in seconds.
For your final reported HPCG time, your output file will need to report that the memory used was at least 25% of the system’s total memory and the run-time was at least 1800 seconds.
NUMA
Non-uniform memory access is an architecture for cpu/memory connection used in modern compute platforms. It uses “fast” connections between CPUs and associated memory regions. In order to achieve peak performance, MPI ranks and processing threads should be assigned to processors consistently (this assignment is called process and thread affinity). This is necessary because the Linux kernel, by default, moves processes between CPUs to achieve load balance.
When using a batch job scheduler like SLURM,
the scheduler is able to call the hwloc
library
and use flags to srun
in order to setup processor
binding for every MPI rank.
On a simple cluster without SLURM, you can launch jobs
directly using mpirun, which can use ssh and a host-file.
Some versions of mpirun, like openmpi, help when deailing with thread affinity.
If you decide to tinker with calling numactl directly,
you can replace mpirun xhpl
with mpirun xhpl.sh
,
and write a script, xhpl.sh
to setup environment
variables and numa options. Some useful guides are here:
http://www.hpc.acad.bg/numactl/, https://glennklockwood.com/hpc-howtos/process-affinity.html.
Key Points
HPCG requires tuning job launch configurations (like NUMA) to achieve peak communication bandwidth.
Few-node performance should be an indicator of full-scale performance.
Summary
Overview
Teaching: 14 min
Exercises: 0 minQuestions
How can I demonstrate my awesome cluster configuration?
Objectives
Start a log of configuration steps and performance.
Organize team investigations using the scientific method.
TL; DR;
A lot of what we communicate has an answer-bias. However, trying to jump directly to the final answer makes it hard to learn why the answer works.
As a competition team, it’s extremely important to keep an organized record of your efforts. This helps both to keep team members informed about what everyone is working on, and also to document your team’s activities for writing your project report.
Implement Kanban
Watch the “Agile Methodologies” presentation and follow the associated hands-on tutorial section of the BSSW Tutorial.
Setting up Kanban
Spend some time setting up a team Kanban board. Either Trello or a project board on github will work just as well.
Create an Initial Issue List
By now, your team should have a lot of ideas to discuss: virtual machine setup, networking and hardware configuration, test benchmarks, and ideas for tweaking performance. Everyone should write down a TODO list locally, and bring it for a team discussion. As a team, you should interactively create a few tasks and assign them to the most energetic and enthusiastic members.
Intermezzo
Watch 12 Ways to Fool the Masses with Irreproducible Results for a fun look into the history of HPC performance reporting. You should come away from the video with a good understanding of the target audience for your report. The emphasis on documenting software versions, compile flags, and extra information needed for reproducing your work is especially helpful for doing a self-review on your work.
Follow-Up
Now your team is off to a great start. As you work through the competition, try and do the following:
-
Keep creating new issues for work you don’t have time to address at the moment.
-
Post status updates to the issues you are working on. Don’t be afraid to modify their descriptions to reflect what you’re really doing, or close issues with short explanations of why they are resolved (or, alternately, no longer important).
-
Discuss recently completed tasks at regular team meetings. Every completed task should have a paragraph or two of text explaining why the task is important, what you did, and what results you obtained.
-
Maintain a separate document with all these paragraph descriptions. Present this as part of your final report.
Reporting
The scientific method contains hypothesis generation, carrying out experiments, collecting data, and writing up results. Your progress on HPL and HPCG benchmarks will be graded based on the optimization ideas you generated and tested. The final report should give some context for the HPL/HPCG benchmarks and system hardware. Then have an organized discussion of the key steps you took to arrive at your final configuration. Things like initial “unoptimized” performance, steps to setup MPI, select blas libraries, threading, job launch, and NUMA configuration, and tile-size optimization should be reported. You may choose to write these as a list of trials with with separate “initial” and “final” timings for each or as a sequential improvement after each configuration change.
Every controlled test is important! Even tests that result in decreased performance will contribute well to your overall evaluation as long as they are well explained and the results are documented.
The report should have a table listing the compuational throughput (“GFLOPS”) reported by HPL and HPCG benchmarks for key steps along your configuration journey. It should also include “theoretical peak GFLOPS” estimated based on the hardware documentation.
Key Points
Cluster configuration has reproducibility challenges.
Spend extra time to plan well-defined performance tests.
Using GPUs
Overview
Teaching: 2 min
Exercises: 25 minQuestions
What modifications to the benchmarks are required to utilize GPUs?
Objectives
Demonstrate methods for using GPUs with HPL and HPCG.
Using GPUs is more difficult than CPU-only codes because separate compilers are needed, and because computations run on the GPU are “offloaded” from the CPU. An offloaded computation is a function that is packed up and sent to the GPU, executed there (usually while the CPU does something else), and then eventually waited on by the CPU.
Importantly for this exercise, the standard HPL code distributed by netlib does not include support for running computations on GPUs. Instead, hardware vendors have supplied these themselves. NVIDIA offers custom programs that run HPL through its (free) developer program. More recently, they have also created a container running HPL and HPCG.
Rather than work with these, I suggest trying the free alternative
HPL-GPU or (older)
HPL-CUDA.
The second was easier to compile. In addition to following
the instructions, you should also set the TOPdir
variable
to the top directory.
To make the most performant binary, it is important to
enable (or at least check) what hardware-optimizations exist.
For CPU-s it’s well-known that you can check the processor
options from /dev/cpuinfo
. For GPUs, the most important
property is the compute capability.
A quick reference chart is here.
Key Points
Using GPUs requires modifications to the standard HPL and HPCG programs.
Using Infiniband
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How does infiniband hardware differ from ethernet?
Objectives
Provide tools, diagnostics, and references on setting up infiniband communication.
Infiniband hardware devices were created to support high-performance computing applications by “short-cutting” around the TCP/IP network stack. In traditional messaging, small packets are sent between two end-points that perform a complicated connection handshake. This involves a lot of buffering and queue wait times, which slow things down. Infiniband hardware provides a remote-direct memory access mechanism for hosts to communicate with one another with fewer requirements around acknowledgments and message sizes.
This comes at a cost. Infiniband devices have their own network addressing mechanism distinct from (but similar to) ethernet MAC addresses. Also, they require special drivers, network configuration utilities, and API calls.
The API calls are simplest, since they are
usually handled for you by an MPI library.
Those MPI libraries mostly use either
the ibverbs
interface or else something like
UCX built on top of ibverbs. You’ll run
into those keywords as you install MPI.
Installing Infiniband
The other two steps are complicated. Here are some notes that could be helpful built by following the RedHat Infiniband Guide and rdmamojo.
# allow users to pin unlimited memory
sudo sh -c "cat >/etc/security/limits.d/rdma.conf" <<.
@users soft memlock unlimited
@users hard memlock unlimited
.
# add self to the users group
sudo usermod -a -G users $USER
# setup the IB manager
sudo yum install -y opensm libibverbs-utils infiniband-diags
sudo systemctl enable --now opensm
# Testing
ibstat # locate Port GUID-s
sudo ibswitches # list switches
sudo sminfo # list sm communication info
ibv_devices
ibv_devinfo mlx4_0
Next, test communication between two hosts.
For example, 10.52.3.83
and 10.52.3.178
.
On one side, run
ibv_rc_pingpong
On abother,
ibv_rc_pingpong 10.52.3.178
Installing MPI
Next, installing MPI using infiniband is easy using spack:
git clone --depth=1 https://github.com/spack/spack.git
spack install -v mpich netmod=ucx
create a hostfile
listing nodes, 1 per line:
# mpi hostfile
10.52.3.83
10.52.3.178
Then run a test application enabling the ucx fabric
BIN=$BIN/bin/mpirun -n 48 -env UCX_NET_DEVICES=mlx4_0:1 -f hostfile hostname
Next, run HPL and check speed:
spack install -v hpl ^mpich netmod=ucx # doesn't seem to activate infiniband
spack install -v hpl ^openmpi fabrics=verbs
Remember, for each run, you should create a separately named input and run-script,
MPIRUN=/home/cc/spack/opt/spack/linux-centos8-haswell/gcc-8.3.1/openmpi-4.1.1-6ykqelh2aqoxk6bbmdp7o63xi33puppg/bin/mpirun
HPL=/home/cc/spack/opt/spack/linux-centos8-haswell/gcc-8.3.1/hpl-2.3-cce3avecwhxksw6eysaklvuipvb2cdo4/bin/xhpl
export btl_openib_allow_ib=true
$MPIRUN -n 48 --mca btl openib,self,vader --mca btl_openib_allow_ib true --hostfile ../hostfile $HPL | tee HPL.out
Key Points
Infiniband networks do not carry standard ethernet traffic, requiring special .
Scripting OpenStack
Overview
Teaching: 5 min
Exercises: 20 minQuestions
Objectives
Provide tools for launching OpenStack instances on TACC.
For reference, here is a set of scripts useful for spinning up servers on Chameleon Cloud from the command-line.
There are no scripts to tear-down your instances. Instead, this can be done on the web-UI Don’t forget to also delete your leases under the “reservations” tab.
It relies on setting up the environment as follows:
Initial setup
- install Chameleon’s blazar (pip3 install git+https://github.com/ChameleonCloud/python-blazarclient.git@chameleoncloud/stable/train)
- install openstack (aptitude install python3-openstackclient)
- Login to the CHI@TACC website. The following steps refer to this site.
- Generate an “Application Credential” using the “Identity” section of the web GUI
- Click the “Unrestricted Access” checkbox (needed for creating reservations from the CLI).
- Save the credential as both
openrc.sh
andclouds.yaml
- run
chmod 700 openrc.sh
andchmod 600 clouds.yaml
- Create a reservation using the Reservations section of the web GUI.
- Create an “Instance” for that reservation using the “Compute” section of the web GUI.
- This will prompt you to generate an SSH keypair.
Save the private key as
ssh-id.rsa
and runchmod 600 ssh-id.rsa
- Name the keypair “ChameleonSSH” so that you know the name to use when accessing from scripts.
- This will prompt you to generate an SSH keypair.
Save the private key as
- Create a “Floating IP” using the “Network” section of the web GUI
-
Use the same section to “Associate” the IP with your instance.
- Connect to the instance via ssh:
ssh -i ssh-id.rsa cc@
IP address
At this point, you can use the web interface to close down all your servers and leases. Alternately, you could follow the side-track below to explore some of the setup tasks coming up.
Side-track, exploring single-node setup
- follow the OSE MPI Benchmarks Guide:
$ mpirun -n 48 ~/inst/libexec/osu-micro-benchmarks/mpi/collective/osu_gather # OSU MPI Gather Latency Test v5.7.1 # Size Avg Latency(us) 1 1.48 2 1.51 4 1.54 8 1.55 16 1.60 32 1.69 64 1.74 128 1.86 256 2.06 512 2.30 1024 3.87 2048 4.87 4096 7.45 8192 12.16 16384 23.92 32768 49.21 65536 97.32 131072 180.46 262144 394.33 524288 1097.02 1048576 2186.11 $ mpirun -n 24 ~/inst/libexec/osu-micro-benchmarks/mpi/collective/osu_gather # OSU MPI Gather Latency Test v5.7.1 # Size Avg Latency(us) 1 1.14 2 1.18 4 1.20 8 1.21 16 1.27 32 1.28 64 1.33 128 1.41 256 1.56 512 1.73 1024 3.10 2048 3.84 4096 5.52 8192 9.71 16384 17.74 32768 34.82 65536 64.14 131072 116.71 262144 227.19 524288 705.59 1048576 1580.68
- follow the HP Linpack guide:
-
give up and use spack spack install gcc@11.1.0 spack install hpl ^blis https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/ ~> HPL.dat
-
HPL result:
T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 115200 192 6 8 3417.23 2.9826e+02 HPL_pdgesv() start time Mon May 24 01:41:36 2021 HPL_pdgesv() end time Mon May 24 02:38:33 2021 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.22081836e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================
-
Next, try www.hpcg-benchmark.org.
-
IMPORTANT: create an password-less ssh key and add it to your allowed-hosts file
ssh-keygen -t ed25519 cp ~/.ssh/id_ed25519.pub ~/.ssh/authorized_keys
this will allow you to login to other nodes when you start this image.
Also important - keep a log of the setup steps you used to create this image in order to share with your team, reproduce, and change later.
- Follow the documentation
to save a snapshot of your image when done (using
sudo cc-snapshot <image_name>
).
Makefile
This Makefile stores the series of steps needed to start all the infrastructure needed to launch a set of servers.
It is intended to be used for launching a single node on which software is installed. Later, this node can be packed into a disk image and launched across several server machines.
Obviously, running the commands here requires that their directory is in your PATH.
# Makefile
NAME = install
MINUTES = 120 # 2 hours
default: $(NAME)-server.ip
$(NAME)-server.ip: $(NAME)-server.json $(NAME)-ip.json
associate_ip.sh $(NAME)-server
$(NAME)-server.json: $(NAME)-lease.json
start_server.sh $(NAME)-lease $(NAME)-server
$(NAME)-ip.json:
lease_ip.sh $(NAME)-ip $(MINUTES)
$(NAME)-lease.json:
lease_server.sh $(NAME)-lease 1 $(MINUTES)
clean:
rm -f trial-*.json
Note: if you have an existing lease, create its json using
bin/activate_lease.sh <name of lease>
. Then, following
the Makefile above, rename it to install-lease.json
.
Then you can use the above Makefile with your existing lease.
Individual commands
Each of the commands below sets up one piece of server infrastructure by launching the appropriate OpenStack command. Several make use of the following python script to read values from a json-file:
#!/usr/bin/env python3
# read_val.py
import json
def first(x):
def run(test):
for xi in x:
if test(xi):
return xi
return run
def main(argv):
assert len(argv) == 3, f"Usage: {argv[0]} <file.json> <key>"
with open(argv[1], 'r', encoding='utf-8') as f:
x = json.loads( f.read() )
if isinstance(x, dict):
local = x
else:
local = {'x': x, 'first': first(x)}
print( eval(argv[2], {'json':json}, local) )
if __name__=="__main__":
import sys
main(sys.argv)
#!/bin/bash
# ../bin/activate_lease.sh
# Read the status of a lease object until it is active.
# Store the result at <lease name>.json
# Exits with an error-code if the lease object is missing
# or not active within 30 seconds.
set -e
if [ $# -ne 1 ]; then
echo "Usage: $0 <name>"
exit 1
fi
name="$1"
# openstack server show -f json interactive-server
# poll until lease is active
for((i=0;i<30;i++)); do
blazar lease-show -f json "$name" >"$name.json"
status=$(read_val.py "$name.json" "status")
[[ x"$status" == x"ACTIVE" ]] && break
sleep 1
done
if [[ x"$status" != x"ACTIVE" ]]; then
echo "Lease not activated."
exit 1
fi
#!/bin/bash
# ../bin/activate_server.sh
# Read the status of a server object until it is active.
# Store the result at <server name>.json
# Exits with an error-code if the server object is missing
# or not active within 600 seconds (10 minutes).
set -e
if [ $# -ne 1 ]; then
echo "Usage: $0 <name>"
exit 1
fi
name="$1"
# poll until active
for((i=0;i<60;i++)); do
openstack server show -f json "$name" >"$name.json"
status=$(read_val.py "$name.json" "status")
echo "Server Status = $status"
[[ x"$status" == x"ACTIVE" ]] && break
sleep 10
done
if [[ x"$status" != x"ACTIVE" ]]; then
echo "Server not activated."
exit 1
fi
#!/bin/bash
# ../bin/associate_ip.sh
# Associate a leased, floating IP with a server
set -e
# FIXME: there's no way to go from an IP reservation
# name to an actual IP address!
if [ $# -ne 1 ]; then
echo "Usage: $0 <server-name>"
exit 1
fi
#lease="$1"
server="$1"
if [ ! -s "$server.json" ]; then
if [ -s "$server-1.json" ]; then
server="$server-1"
else
echo "File not found: $server.json"
exit 1
fi
fi
server_id=$(read_val.py "$server.json" id)
iplist=`mktemp`
openstack floating ip list -f json >>$iplist
ip_addr=$(read_val.py "$iplist" 'first(lambda z: z["Port"] is None)["Floating IP Address"]')
rm -f $iplist
echo "Associating $ip_addr to $server"
#openstack floating ip add "$ip_addr" "$server_id"
#openstack floating ip set --port "$server_id" "$ip_addr"
openstack server add floating ip "$server_id" "$ip_addr"
echo "$ip_addr" >"$server.ip"
#!/bin/bash
# ../bin/lease_ip.sh
# Create a floating IP lease for the next <n> minutes.
set -e
if [ $# -ne 2 ]; then
echo "Usage: $0 <name> <minutes>"
exit 1
fi
name="$1"
n="$2"
start=$(TZ="UTC" date --date 'now' "+%Y-%m-%d %H:%M")
end=$(TZ="UTC" date --date "now+$n minutes" "+%Y-%m-%d %H:%M")
echo "Creating IP lease from $start to $end (UTC)"
PUBLIC_NETWORK_ID=$(openstack network show public -c id -f value)
blazar lease-create \
--reservation resource_type=virtual:floatingip,network_id=${PUBLIC_NETWORK_ID},amount=1 \
--start-date "$start" \
--end-date "$end" \
"$name"
activate_lease.sh "$name"
#!/bin/bash
# ../bin/lease_server.sh
# Create a node lease for the next <n> minutes.
set -e
if [ $# -ne 3 ]; then
echo "Usage: $0 <name> <number> <minutes>"
exit 1
fi
name="$1"
n="$2"
t="$3"
if [ $n -le 0 ]; then
echo "Invalid number: $n"
exit 1
fi
if [ $t -le 0 ]; then
echo "Invalid time: $t"
exit 1
fi
start=$(TZ="UTC" date --date 'now' "+%Y-%m-%d %H:%M")
end=$(TZ="UTC" date --date "now+$t minutes" "+%Y-%m-%d %H:%M")
echo "Creating node lease from $start to $end (UTC)"
# --physical-reservation min=$n,max=$n,resource_properties='["and", ["=", "$infiniband", "True"], [">=", "$gpu.gpu_count", "1"]]' \
# --physical-reservation min=$n,max=$n,resource_properties='["==","$node_name","c01-18"]' \
blazar lease-create \
--start-date "$start" \
--physical-reservation min=$n,max=$n,resource_properties='["=", "$node_type", "compute_haswell"]' \
--end-date "$end" \
"$name"
# Notes:
# - directly read status from lease
# status=$(blazar lease-show "$name" -c status -f value)
# - extend an existing lease (TODO)
# blazar lease-update --end-date "$end" "$name"
# - list all floating IP-s
# openstack floating ip list
# - list all servers
# openstack server list
activate_lease.sh "$name"
#!/usr/bin/env python3
# ../bin/read_val.py
import json
def first(x):
def run(test):
for xi in x:
if test(xi):
return xi
return run
def main(argv):
assert len(argv) == 3, f"Usage: {argv[0]} <file.json> <key>"
with open(argv[1], 'r', encoding='utf-8') as f:
x = json.loads( f.read() )
if isinstance(x, dict):
local = x
else:
local = {'x': x, 'first': first(x)}
print( eval(argv[2], {'json':json}, local) )
if __name__=="__main__":
import sys
main(sys.argv)
#!/bin/bash
# ../bin/start_server.sh
# Start a server on the given reservation.
set -e
if [ $# -ne 2 ]; then
echo "Usage: $0 <lease-name> <insance-name>"
exit 1
fi
lease="$1"
instance="$2"
res_id=$(read_val.py "$lease.json" "json.loads(reservations)['id']")
min=$(read_val.py "$lease.json" "json.loads(reservations)['min']")
max=$(read_val.py "$lease.json" "json.loads(reservations)['max']")
#blazar lease-show interactive-lease
net_id=$(openstack network list --name sharednet1 -c ID -f value)
openstack server create \
--image CC-CentOS8 \
--flavor baremetal \
--key-name ChameleonSSH \
--nic net-id="$net_id" \
--hint reservation="$res_id" \
--min $min \
--max $max \
"$instance"
if [ $min -eq 1 ]; then
activate_server.sh "$instance"
else
for((i=1;i<=min;i++)); do
activate_server.sh "$instance-$i"
done
touch "$instance.json"
fi
Example Configurations
#!/usr/bin/env bash
# openrc.sh, mostly generated by Chameleon Cloud.
export OS_AUTH_TYPE=v3applicationcredential
export OS_AUTH_URL=https://chi.tacc.chameleoncloud.org:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_REGION_NAME="CHI@TACC"
export OS_INTERFACE=public
export OS_APPLICATION_CREDENTIAL_ID=long-string-with-the-credential
export OS_APPLICATION_CREDENTIAL_SECRET=long-string-with-the-secret
export PATH=$PATH:$PWD/bin
Key Points