BLAS

Overview

Teaching: 10 min
Exercises: 20 min
Questions
  • What blas should I link with?

  • How can I get all these libraries installed and working together?

  • What are the limitations of blas and lapack?

Objectives
  • Understand the differences between blas implementations.

  • Correctly compile, link, and test linear algebra libraries.

The compilers, GPU, and network are working well. Now it’s time to run a real high-intensity problem.

The following code shamelessly adapted from stack-overflow uses the lapack library to solve a small matrix-vector problem. It’s the single-node, CPU version of the problem HPL solves.

// https://stackoverflow.com/questions/36063993/lapack-dgetrs-vs-dgesv
#include <stdio.h>
#include <vector>
#include <stdlib.h>

using namespace std;

extern "C" {
  void daxpy_(int* n,double* alpha,double* dx,int* incx,double* dy,int* incy);
  double dnrm2_(int* n,double* x, int* incx);

  void dgetrf_(int* M, int *N, double* A, int* lda, int* IPIV, int* INFO);
  void dgetrs_(char* C, int* N, int* NRHS, double* A, int* LDA, int* IPIV, double* B, int* LDB, int* INFO);
  void dgesv_(int *n, int *nrhs, double *a, int *lda, int *ipiv, double *b, int *ldb, int *info);
}

void print(const char* head, int m, int n, double* a){
  printf("\n%s\n---------\n", head);
  for(int i=0; i<m; ++i){
    for(int j=0; j<n; ++j){
      printf("%f ", a[i+j*m]);
    }
    printf("\n");
  }
  printf("---------\n");
}

int main(int argc, char *argv[]) {
  int dim = 5;
  if(argc > 1)
    dim = atoi(argv[1]);

  vector<double> a(dim * dim);
  vector<double> b(dim);
  srand(1);              // seed the random # generator with a known value
  double maxr = (double)RAND_MAX;
  for (int r = 0; r < dim; r++) {  // set a to a random matrix, i to the identity
    for (int i = 0; i < dim; i++) {
      a[r + i*dim] = rand() / maxr;
    }
    b[r] = rand() / maxr;
  }

  int info;
  int one = 1;
  char N = 'N';
  vector<int> ipiv(dim);

  vector<double> a1(a);
  vector<double> b1(b);
  dgesv_(&dim, &one, a1.data(), &dim, ipiv.data(), b1.data(), &dim, &info);
  print("B1",dim < 5 ? dim : 5,1,b1.data());

  vector<double> a2(a);
  vector<double> b2(b);
  dgetrf_(&dim, &dim, a2.data(), &dim, ipiv.data(), &info);
  dgetrs_(&N, &dim, &one, a2.data(), &dim, ipiv.data(), b2.data(), &dim, &info);
  print("B2",dim < 5 ? dim : 5,1,b2.data());

  double d_m_one = -1e0;
  daxpy_(&dim,&d_m_one,b2.data(),&one,b1.data(),&one);
  printf("\ndiff = %e\n", dnrm2_(&dim,b1.data(),&one));

  return 0;
}

To actually compile and use this, however, you’ll need to install a BLAS and LAPACK library. This is where the actual fun begins. There are many, many different implementations of BLAS available.

Information about the Linux packages can be obtained on Debian-based systems by querying the package manager with commands like aptitude search blas, and aptitude show liblapack3. On Redhat-based systems, use the corresponding yum search and yum info.

Blas Options

Not only do different implementation exist, but there are different compilation options within the same BLAS package that can change its performance. Here’s a table explaining each variation:

Option Help
ilp64 This changes the integer sizes on things like matrix dimensions
and tile sizes to use 64-bit integers.
Many systems have 32-bit integers by default,
which limits the maximum size a matrix can grow to
threads pthreads and OpenMP are common ways to use multiple processor
cores for a single matrix operation.
With no threading, blas doesn’t use threads. Thus, only one
processor core will be used for each blas call.
That can be good if your code uses threads to make
multiple blas calls simultaneously. If your code uses only
one thread, though, you will want to have blas use threads.
static/shared Shared object linking reduces the size of your code, but costs a
tiny bit more for every call to BLAS, and introduces the potential for
missing libraries if you ever reinstall anything.
If your code makes millions of blas calls, you may want to try static linking.

Compile Your Own Blas/Lapack

For getting the most performance, it is recommended not to use a Linux package, but to use a vendor package or to compile your own BLAS.

Compiling your own can be extremely frustrating. Because of this, I recommend installing blas using spack. In fact, you can compile MPI, HPL, and HPCG using spack too. The search and info commands for spack are spack list hpl and spack info hpl.

Install HPL on OpenBlas and OpenMPI

Use spack to install the openblas and openmpi libraries. Then, instruct spack to use those two as dependencies when installing HPL.

Solution

cd $HOME
git clone --depth=1 https://github.com/spack/spack.git
source spack/share/spack/setup-env.sh
spack install openblas threads=openmp
spack install openmpi +cuda fabrics=auto
spack spec -lINt hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto
spack install -v hpl ^openblas threads=openmp ^openmpi +cuda fabrics=auto

Pay special attention to the configure and compile lines spit out using spack install -v. These can be helpful guides if you later decide to run the compile and install process yourself rather than let spack handle it.

Run The Solve Test Program

Now that everything is installed, you should be able to run HPL. But first, let’s try and compile the lapack-dgetrs-vs-dgesv program starting out this lesson.

The key idea is to provide the directory containing lapack to the compiler during linking.

Solution

BLAS=$(spack find --format '{prefix}' openblas)
g++ -o solve solve.cc -L$BLAS/lib -lopenblas
time ./solve 2048

You should explore the effect of varying OMP_NUM_THREADS and launching the above using MPI. It’s also important to note that this program is not using the GPU. Thus, a much faster solve could have been achieved if cublas were being called instead of openblas.

Key Points

  • Linking with vendor-optimized libraries is a pain in the neck.

  • Standard BLAS/LAPACK doesn’t use co-processors.