Assembly Line ############# The practice of forming an assembly line dates back at least to the ancient Romans. In computational work, you sometimes want to get each CPU and GPU to perform the same task on different work items -- handing them off to the next worker after each is done. .. image:: pipeline.svg :alt: A1A2A1 in parallel with B1B2B1 pipeline :align: center CUDA accomplishes this with its ``streams`` API, while a fully-CPU implementation would use locks to wait on ``ready`` events. Both methods are combined in the ``Event`` class below:: #include #include #include #include //#define ENABLE_CUDA #ifdef ENABLE_CUDA #include #define CHECKCUDA(cmd) do { \ cudaError_t e = cmd; \ if( e != cudaSuccess ) { \ printf("Failed: Cuda error %s:%d '%s'\n", \ __FILE__,__LINE__,cudaGetErrorString(e)); \ exit(EXIT_FAILURE); \ } \ } while(0) #else // skip CUDA commands in non-CUDA mode #define CHECKCUDA(cmd) typedef void* cudaEvent_t; typedef void* cudaStream_t; #endif struct Event { Event() { CHECKCUDA( cudaEventCreateWithFlags(&ev, cudaEventDisableTiming) ); sync.lock(); } Event(const Event &) = delete; Event &operator=(Event &) = delete; ~Event() { CHECKCUDA( cudaEventDestroy(ev) ); } void record(cudaStream_t stream) { CHECKCUDA( cudaEventRecord(ev, stream) ); sync.unlock(); } void wait(cudaStream_t stream) { sync.lock(); // will not succeed until record call has completed CHECKCUDA( cudaEventSynchronize(ev) ); //CHECKCUDA( cudaStreamWaitEvent(stream, ev, 0) ); // leave in the locked state so that another record() is needed. } std::mutex sync; cudaEvent_t ev; }; // do work void work(cudaStream_t stream, const char *name, int i, int time) { //CHECKCUDA( cudaStreamSynchronize(stream) ); usleep(time*1000); // do some work on stream A printf("%s item %d\n", name, i); } int main(int argc, char *argv[]) { cudaStream_t sA, sB; CHECKCUDA( cudaStreamCreateWithFlags(&sA, cudaStreamNonBlocking) ); CHECKCUDA( cudaStreamCreateWithFlags(&sB, cudaStreamNonBlocking) ); Event a[2], b[2]; // Events, named after recording process. #pragma omp parallel num_threads(2) { int thread = omp_get_thread_num(); int threads = omp_get_num_threads(); printf("thread %d/%d\n", thread, threads); for(int i=0; i<10; i++) { if(threads != 2 || thread == 0) { // A if(i > 1) b[i%2].wait(sA); // wait for resources returned from B2 work(sA, "A", i, 100); a[i%2].record(sA); } if(threads != 2 || thread == 1) { // B a[i%2].wait(sB); work(sB, "B", i, 300); b[i%2].record(sB); } } } CHECKCUDA( cudaStreamDestroy(sB) ); CHECKCUDA( cudaStreamDestroy(sA) ); return 0; } Worker ``A`` works on item ``A1``, then signals (records) ``Event A1`` and moves on to ``A2``. Worker ``B`` waits for ``Event A1``, processes it into item ``B1``, then signales (records) ``Event B1``. Eventually worker ``A`` completes ``A2``, and moves onto ``A3``. Here, however, it can just re-use resources from ``A1``, as long as it first waits for ``Event B2``. In actual application code, workers ``A`` and ``B`` communicate via a circular buffer, reading or writing ``buf[i%2]`` at step i. This strategy is known as double-buffering. Note also how the code above actually works even when only one thread is available. This regrettable situation might occur if the code is compiled without openmp or run incorrectly because the user has requested only 1 thread. There's a quirk with CUDA events, where they don't register until recorded. We get around this by requiring a wait on a CPU-lock so that ``cudaEventSynchronize`` can't get called until after the ``cudaEventRecord`` has been called. This CPU-lock has the side-effect of creating all the necessary synchronization for a CPU-only implementation. Unfortunately, ``cudaStreamWaitEvent`` can't be called when waiting for the event, since it doesn't block the CPU. This would make both GPU-only and mixed CPU/GPU implementations incorrect. For GPU-only implementations, subsequent events would get recorded too early. The mixed CPU/GPU implementation would get into trouble when waiting starting a CPU operation that required prior GPU events to have completed. The trade-off is a few milliseconds of kernel launch latency as well as synchronization latency from cudaStreamWaitEvent itself. It's possible to eek out more performance by using a vector of events (getting around the "too early recording" problem) and uncommenting the ``cudaStreamSynchronize`` call for every CPU function.