How Futhark manages GPU memory

Posted on January 28, 2018 by Troels Henriksen

We recently changed how code generated by the Futhark compiler allocates and frees GPU memory. In this post, we will discuss what we used to do, what we do now, why Futhark cannot immediately use a standard malloc() approach, and the performance impact of the new allocator.

Background

When we first implemented the GPU/OpenCL backend for Futhark, we did not pay much attention to memory management. Whenever we needed some memory, we would simply call clCreateBuffer() and get back a cl_mem. This leaked memory, of course, so we eventually added a simple reference counting scheme to free unused memory with clReleaseMemObject(). The design of the Futhark language ensures the absence of reference cycles, and a typical Futhark program performs few but large allocations. This allowed us to avoid the usual problems with reference counting, and it worked well for a long time.

However, we were still calling the GPU driver for every allocation. Metaphorically, we were essentially using mmap() and munmap() all the time instead of using malloc(). As long as every allocation was followed by a sufficiently expensive operation, this did not have much impact on overall program performance. Recently, two factors conspired to make this approach non-viable:

The quality of the code emitted by the Futhark compiler improved to the point where memory allocation was sometimes the bottleneck, even for non-contrived programs. As an example, we had a GPU kernel computing an FFT that took 500µs to run, of which 300µs were spent allocating memory for storing the result.

Recent GPU drivers seem to have significantly slowed down allocation. Previously, we observed near-constant time allocation times - for example, 100µs on an NVIDIA GTX 780 with CUDA 8.0. But with the CUDA 9.0 drivers, allocation times are now linearly proportional to the size of the allocation, and rather slow too: approximately 15GiB/s (OpenCL measuring program, CUDA measuring program). This change significantly increases how much computation is needed to amortise the cost of allocation. We have observed almost exactly the same allocation times on an AMD Vega 64, a completely unrelated GPU, so we suspect that the cause is some security mechanism and not merely a bug, so we cannot simply expect things to go back the way they used to be. (We also reported the issue to NVIDIA, and the issue is either still open, or marked it as WONTFIX - it’s a bit hard to tell from their bug tracker.)

We had to do better. The only reasonable solution is to reduce the amount of times we pass memory back and forth between the Futhark program and the GPU driver. Instead, when the Futhark program “frees” a piece of memory, it now adds it to a free list, from which future allocations are served. We then ask for GPU memory only when an allocation request cannot be served from the free list. This is pretty much exactly how malloc() works.

Problems and Solutions

The largest potential problem with the free list approach is risk of (external) fragmentation, in which sufficient free memory is in principle available for an incoming allocation request, but it is split among multiple free blocks, each of which is individually too small. Consider for example a case where the GPU has 1024MiB of total memory, the free list contains three blocks of 256MiB each, and we wish to allocate 512MiB. Even though the free list contains a total of 768MiB, we cannot satisfy this request with any single block. From the GPUs point of view, the 768MiB in the free list are still in use, and hence we will fail if we ask the GPU for an allocation of 512MiB. Our only recourse is to start taking memory out of the free list and handing it back to the GPU, until the GPU driver is willing to satisfy the 512MiB allocation request. This is potentially a time-consuming process.

Traditional malloc() implementations deal with this issue by a variety of means. Unfortunately, we cannot simply use the existing literature on good malloc() implementations. The basic technique used by malloc() is the ability to coalesce neighbouring blocks of free memory into a single larger block. This helps reduce fragmentation. However, this technique is not available to us. The problem is that the OpenCL API does not make GPU memory available to us as numeric pointers that we can manipulate. Rather, we interact with opaque cl_mem objects. There is no way to combine two cl_mem objects into a single larger cl_mem object, nor to split a cl_mem object in two. This sharply limits how clever we can be about managing the free list.

In the absence of additional information about allocation patterns, the best solution is probably the straightforward approach we sketched above - let the free list grow without bound, and hand memory back to the GPU once allocations start failing. This is the strategy used by the MemoryPool in PyOpenCL, and it really does work well.

But in Futhark, we do have more information, and we can do better. The trick is that we can write an allocator specifically tailored to the allocation patterns of code generated by the Futhark compiler. When the Futhark compiler generates an allocation, it takes the following form in the imperative IR used in the later parts of the compiler pipeline:

m_i = alloc(size)

Here, the i is a statically unique label that identifies the location of the allocation. The trick now is to tag all memory blocks with the label of the allocation where they were produced and limit the free list to one block per distinct label; handing excess blocks back to the GPU. The approach is simple enough that we can show the actual implementation we use in Futhark. First, a struct for describing an entry in the free list:

struct opencl_free_list_entry {
  size_t size;
  cl_mem mem;
  const char *tag;
  unsigned char valid;
}

We represent the labels as const char* values. Since distinct string literals will have distinct memory locations, we can just compare pointers, and need not use strcmp(). The valid field is used to avoid having to physically remove elements from the free list itself, which is implemented as an array:

struct opencl_free_list {
  struct opencl_free_list_entry *entries; // Pointer to entries.
  int capacity;                           // Number of entries.
  int used;                               // Number of valid entries.
};

The free list is initialised as follows:

void free_list_init(struct opencl_free_list *l) {
  l->capacity = 30; // Picked arbitrarily.
  l->used = 0;
  l->entries = malloc(sizeof(struct opencl_free_list_entry) * l->capacity);
  for (int i = 0; i < l->capacity; i++) {
    l->entries[i].valid = 0;
  }
}

There are also some utility functions for destroying the free list and compacting it by removing invalid elements, but we will skip those. We have a function for inserting an entry into the free list:

void free_list_insert(struct opencl_free_list *l, size_t size, cl_mem mem, const char *tag);

We are eliding the implementation, because it is not very interesting; most of the logic is about enlarging the array if there are no empty (“invalid”) spots left. A more interesting function is free_list_find(), which finds a valid entry with a given tag:

/* Find and remove a memory block of at least the desired size and
   tag.  Returns 0 on success.  */
int free_list_find(struct opencl_free_list *l,
                   const char *tag,
                   size_t *size_out, cl_mem *mem_out) {
  int i;
  for (i = 0; i < l->capacity; i++) {
    if (l->entries[i].valid && l->entries[i].tag == tag) {
      l->entries[i].valid = 0;
      *size_out = l->entries[i].size;
      *mem_out = l->entries[i].mem;
      l->used--;
      return 0;
    }
  }

  return 1;
}

We use this in our implementation if opencl_alloc(), which is the actual function called by code generated by the Futhark compiler. We will not show the definition of the opencl_context structure; for now we only need to know that it contains the actual free list (in the free_list field).

int opencl_alloc(struct opencl_context *ctx,
                 size_t min_size, const char *tag, cl_mem *mem_out) {
  assert(min_size >= 0);
  if (min_size < sizeof(int)) {
    min_size = sizeof(int);
  }

  size_t size;

  if (free_list_find(&ctx->free_list, tag, &size, mem_out) == 0) {
    // Successfully found a free block.  Is it big enough, but not too big?
    if (size >= min_size && size <= min_size*2) {
      return CL_SUCCESS;
    } else {
      // Not just right - free it.
      int error = clReleaseMemObject(*mem_out);
      if (error != CL_SUCCESS) {
        return error;
      }
    }
  }

  // We have to allocate a new block from the driver.
  int error;
  *mem_out = clCreateBuffer(ctx->ctx, CL_MEM_READ_WRITE, min_size, NULL, &error);
  return error;
}

The main heuristic here is that we refuse to re-use an existing block if it is more than twice as big as needed. Otherwise we run the risk of internal fragmentation, where we are only using a fraction of each allocated block. A normal malloc() implementation would just split the block, but that’s not an option with OpenCL.

Generated code frees memory by calling opencl_free():

int opencl_free(struct opencl_context *ctx, cl_mem mem, const char *tag) {
  size_t size;
  cl_mem existing_mem;

  // If there is already a block with this tag, then remove it.
  if (free_list_find(&ctx->free_list, tag, &size, &existing_mem) == 0) {
    int error = clReleaseMemObject(existing_mem);
    if (error != CL_SUCCESS) {
      return error;
    }
  }

  int error = clGetMemObjectInfo(mem, CL_MEM_SIZE, sizeof(size_t), &size, NULL);

  if (error == CL_SUCCESS) {
    free_list_insert(&ctx->free_list, size, mem, tag);
  }

  return error;
}

That’s really all there is to it. While this implementation still has some weaknesses (such as not removing elements from the free list of clCreateBuffer() calls fail), and some heuristics to tune (maybe we can steal an allocation with another label rather than going straight to clCreateBuffer()), it works well. Importantly, it ensures that the free list can contain more elements than there are distinct points of allocations in the program. Dynamically, the effect of this allocator is that an allocation inside a Futhark function will tend to be serviced by re-using the same memory that was used last time the function was called.

The reason this very simple allocator works is because we can assume that the label for an allocation is a reasonable identifying characteristic. In a C program, this would not fly: not only are many malloc()s found within utility functions called from a wide variety of locations, but the number of allocations is also significantly higher. Due to the aggressive inlining performed by the Futhark compiler, and in general the very simple dynamic behaviour of a compiled Futhark program, we can get away with policies that would not work in a less constrained environment.

Impact

We knew that this memory manager would have an effect on a few programs where the compiler generated particularly nice code, but we were surprised at the impact it had even on programs that we did not expect to be bottlenecked by allocation speed:

futhark-benchmarks/accelerate/canny/canny.fut
  data/lena256.in:                                                      9.06x
  data/lena512.in:                                                      9.18x

futhark-benchmarks/accelerate/kmeans/kmeans.fut
  data/k5_n50000.in:                                                    1.27x
  data/k5_n200000.in:                                                   1.10x

futhark-benchmarks/accelerate/ray/trace.fut
  #0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 4..."):         1.30x

futhark-benchmarks/finpar/OptionPricing.fut
  OptionPricing-data/small.in:                                          1.10x
  OptionPricing-data/large.in:                                          1.04x
  OptionPricing-data/medium.in:                                         1.26x

futhark-benchmarks/jgf/crypt/crypt.fut
  crypt-data/medium.in:                                                 2.12x

futhark-benchmarks/rodinia/bfs/bfs_parallel_mapwrite.fut
  data/4096nodes.in:                                                    2.03x
  data/graph1MW_6.in:                                                   1.04x
  data/512nodes_high_edge_variance.in:                                  2.93x

futhark-benchmarks/rodinia/lavaMD/lavaMD.fut
  data/10_boxes.in:                                                     1.11x

futhark-benchmarks/rodinia/lud/lud.fut
  data/2048.in:                                                         1.15x
  data/512.in:                                                          1.16x
  data/256.in:                                                          1.13x

The above shows speedup on a range of Futhark benchmark programs on various datasets when using this new allocator compared to the old approach. The speedups are due to the common benchmarking methodology performing several runs of each program, with the free list being left intact after every run, thus letting subsequent runs perform memory allocations without communicating with the GPU at all. This is quite similar to performing multiple runs to “warm up the caches”, or letting a JIT compiler do its work.

While some of these performance improvements were expected - we had for example long known that canny was bottlenecked by memory management - we also saw unexpected improvement in other benchmarks. Some, such as lavaMD, we already felt we were handling quite well, but this new allocator still managed to obtain an extra 10% performance. A surprise to be sure, but a welcome one.

The addition of more sophisticated memory management also had an effect on the transformations done by the compiler itself. Until now, we assumed that allocations would be expensive, and therefore the compiler went to great lengths to hoist them out of loops, even at the cost of performing extra memory copies. This hurt us particularly with stencil-like programs, which are fundamentally a sequential loop surrounding some parallel operation. Previously, we would always suffer an extra copy of the loop result. On benchmarks such as Hotspot, this resulted in the Futhark implementation running around 20% slower than the reference implementation from Rodinia. With the new allocator, allocations are much cheaper, and we can simply keep an allocation inside the body of the loop, which will then be served by a free list entry inserted by a previous iteration. Operationally, the effect is about the same as the conventional approach of implementing double-buffering via pointer-swapping. This change provided a 40% speedup on Hotspot, meaning that the Futhark implementation is now faster than the reference implementation. Our previous behaviour on stencils was a little embarrassing, so this is a very nice improvement.

Conclusions

I am personally happy with the performance improvements, of course, but also a little miffed. It turns out that a relatively little change to memory allocation had a greater impact on performance than most of our sophisticated optimisations, all of which took significantly more effort to implement. But I suppose that such is the life of a compiler writer.