BLAS
Overview
Teaching: 10 min
Exercises: 20 minQuestions
What blas should I link with?
How can I get all these libraries installed and working together?
What are the limitations of blas and lapack?
Objectives
Understand the differences between blas implementations.
Correctly compile, link, and test linear algebra libraries.
The compilers, GPU, and network are working well. Now it’s time to run a real high-intensity problem.
The following code shamelessly adapted from stack-overflow uses the lapack library to solve a small matrix-vector problem. It’s the single-node, CPU version of the problem HPL solves.
// https://stackoverflow.com/questions/36063993/lapack-dgetrs-vs-dgesv
#include <stdio.h>
#include <vector>
#include <stdlib.h>
using namespace std;
extern "C" {
void daxpy_(int* n,double* alpha,double* dx,int* incx,double* dy,int* incy);
double dnrm2_(int* n,double* x, int* incx);
void dgetrf_(int* M, int *N, double* A, int* lda, int* IPIV, int* INFO);
void dgetrs_(char* C, int* N, int* NRHS, double* A, int* LDA, int* IPIV, double* B, int* LDB, int* INFO);
void dgesv_(int *n, int *nrhs, double *a, int *lda, int *ipiv, double *b, int *ldb, int *info);
}
void print(const char* head, int m, int n, double* a){
printf("\n%s\n---------\n", head);
for(int i=0; i<m; ++i){
for(int j=0; j<n; ++j){
printf("%f ", a[i+j*m]);
}
printf("\n");
}
printf("---------\n");
}
int main(int argc, char *argv[]) {
int dim = 5;
if(argc > 1)
dim = atoi(argv[1]);
vector<double> a(dim * dim);
vector<double> b(dim);
srand(1); // seed the random # generator with a known value
double maxr = (double)RAND_MAX;
for (int r = 0; r < dim; r++) { // set a to a random matrix, i to the identity
for (int i = 0; i < dim; i++) {
a[r + i*dim] = rand() / maxr;
}
b[r] = rand() / maxr;
}
int info;
int one = 1;
char N = 'N';
vector<int> ipiv(dim);
vector<double> a1(a);
vector<double> b1(b);
dgesv_(&dim, &one, a1.data(), &dim, ipiv.data(), b1.data(), &dim, &info);
print("B1",dim < 5 ? dim : 5,1,b1.data());
vector<double> a2(a);
vector<double> b2(b);
dgetrf_(&dim, &dim, a2.data(), &dim, ipiv.data(), &info);
dgetrs_(&N, &dim, &one, a2.data(), &dim, ipiv.data(), b2.data(), &dim, &info);
print("B2",dim < 5 ? dim : 5,1,b2.data());
double d_m_one = -1e0;
daxpy_(&dim,&d_m_one,b2.data(),&one,b1.data(),&one);
printf("\ndiff = %e\n", dnrm2_(&dim,b1.data(),&one));
return 0;
}
To actually compile and use this, however, you’ll need to install a BLAS and LAPACK library. This is where the actual fun begins. There are many, many different implementations of BLAS available.
-
On Mac OSX, you can use the system-wide installation by adding the compile flag
-framework Accelerate
. -
On Linux, you’ll find libopenblas, libblas, libatlas, and libgslcblas. All three provide different implementations of the same BLAS functions. Both openblas and atlas include lapack. But, if you install libblas or libgslcblas, you don’t get the lapack functions. Instead, liblapack is a separate Linux package compatible only with libblas.
-
Intel offers BLAS/LAPACK through their Math Kernel Library (MKL). There are so many different variations provided by MKL that Intel provides a link line advisor to find the exact library names to link with depending on your variation. Note, the most common variation is for 64-bit processors with 32-bit integers and openmp.
-
AMD has written its own BLAS optimized for AMD platforms called blis. Blis has a lapack implementation called libFLAME.
-
Co-processor accelerators (GPUs and Xeon Phi cards) use separate blas libraries. For GPUs, these are cublas and rocblas. For Xeon Phi cards, the Intel MKL will work but special run-time setup is needed.
-
Cray systems provide BLAS and LAPACK through libsci. Their libsci-acc includes support for co-processor accelerators.
-
Both Intel MKL and Cray Libsci-Acc provide scalapack, which is an MPI-distributed version of lapack.
-
Rather than using blas and lapack, I recommend that new C++ codes target blaspp and lapackpp. These packages wrap the old Fortran blas and lapack interfaces into a nice templated C++ design. They are only wrappers, however, so installing them requires pointing them to an already-installed version of blas and lapack.
Information about the Linux packages can be obtained
on Debian-based systems by querying the package manager
with commands like aptitude search blas
, and
aptitude show liblapack3
.
On Redhat-based systems, use the corresponding
yum search
and yum info
.
Blas Options
Not only do different implementation exist, but there are different compilation options within the same BLAS package that can change its performance. Here’s a table explaining each variation:
Option | Help |
---|---|
ilp64 | This changes the integer sizes on things like matrix dimensions and tile sizes to use 64-bit integers. Many systems have 32-bit integers by default, which limits the maximum size a matrix can grow to |
threads | pthreads and OpenMP are common ways to use multiple processor cores for a single matrix operation. With no threading, blas doesn’t use threads. Thus, only one processor core will be used for each blas call. That can be good if your code uses threads to make multiple blas calls simultaneously. If your code uses only one thread, though, you will want to have blas use threads. |
static/shared | Shared object linking reduces the size of your code, but costs a tiny bit more for every call to BLAS, and introduces the potential for missing libraries if you ever reinstall anything. If your code makes millions of blas calls, you may want to try static linking. |
Compile Your Own Blas/Lapack
For getting the most performance, it is recommended not to use a Linux package, but to use a vendor package or to compile your own BLAS.
Compiling your own can be extremely frustrating.
Because of this, I recommend installing blas
using spack.
In fact, you can compile MPI, HPL, and HPCG using
spack too. The search and info commands
for spack are spack list hpl
and spack info hpl
.
Install HPL on OpenBlas and OpenMPI
Use spack to install the openblas and openmpi libraries. Then, instruct spack to use those two as dependencies when installing HPL.
Solution
cd $HOME git clone --depth=1 https://github.com/spack/spack.git source spack/share/spack/setup-env.sh spack install openblas threads=openmp spack install openmpi +cuda fabrics=auto spack spec -lINt hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto spack install -v hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto
Pay special attention to the configure and compile lines spit out
using spack install -v
. These can be helpful guides if you later
decide to run the compile and install process yourself
rather than let spack handle it.
Run The Solve Test Program
Now that everything is installed, you should be able to run HPL. But first, let’s try and compile the
lapack-dgetrs-vs-dgesv
program starting out this lesson.The key idea is to provide the directory containing lapack to the compiler during linking.
Solution
BLAS=$(spack find --format '{prefix}' openblas) g++ -o solve solve.cc -L$BLAS/lib -lopenblas time ./solve 2048
You should explore the effect of varying OMP_NUM_THREADS
and
launching the above using MPI. It’s also important
to note that this program is not using the GPU.
Thus, a much faster solve could have been achieved
if cublas were being called instead of openblas.
Key Points
Linking with vendor-optimized libraries is a pain in the neck.
Standard BLAS/LAPACK doesn’t use co-processors.