How Futhark manages GPU memory
We recently changed how code generated by the Futhark compiler
allocates and frees GPU memory. In this post, we will discuss what we
used to do, what we do now, why Futhark cannot immediately use a
standard malloc()
approach, and the performance impact of the new
allocator.
Background
When we first implemented the GPU/OpenCL backend for Futhark, we did
not pay much attention to memory management. Whenever we needed some
memory, we would simply call clCreateBuffer()
and get back a
cl_mem
. This leaked memory, of course, so we eventually added a
simple reference counting scheme to free unused memory with
clReleaseMemObject()
. The design of the Futhark language ensures
the absence of reference cycles, and a typical Futhark program
performs few but large allocations. This allowed us to avoid the
usual problems with reference counting, and it worked well for a long
time.
However, we were still calling the GPU driver for every allocation.
Metaphorically, we were essentially using mmap()
and munmap()
all the time instead of using malloc()
. As long as every
allocation was followed by a sufficiently expensive operation, this
did not have much impact on overall program performance. Recently,
two factors conspired to make this approach non-viable:
- The quality of the code emitted by the Futhark compiler improved to the point where memory allocation was sometimes the bottleneck, even for non-contrived programs. As an example, we had a GPU kernel computing an FFT that took 500µs to run, of which 300µs were spent allocating memory for storing the result.
- Recent GPU drivers seem to have significantly slowed down allocation. Previously, we observed near-constant time allocation times - for example, 100µs on an NVIDIA GTX 780 with CUDA 8.0. But with the CUDA 9.0 drivers, allocation times are now linearly proportional to the size of the allocation, and rather slow too: approximately 15GiB/s (OpenCL measuring program, CUDA measuring program). This change significantly increases how much computation is needed to amortise the cost of allocation. We have observed almost exactly the same allocation times on an AMD Vega 64, a completely unrelated GPU, so we suspect that the cause is some security mechanism and not merely a bug, so we cannot simply expect things to go back the way they used to be. (We also reported the issue to NVIDIA, and the issue is either still open, or marked it as WONTFIX - it’s a bit hard to tell from their bug tracker.)
We had to do better. The only reasonable solution is to reduce the
amount of times we pass memory back and forth between the Futhark
program and the GPU driver. Instead, when the Futhark program “frees”
a piece of memory, it now adds it to a free list, from which future
allocations are served. We then ask for GPU memory only when an
allocation request cannot be served from the free list. This is
pretty much exactly how malloc()
works.
Problems and Solutions
The largest potential problem with the free list approach is risk of (external) fragmentation, in which sufficient free memory is in principle available for an incoming allocation request, but it is split among multiple free blocks, each of which is individually too small. Consider for example a case where the GPU has 1024MiB of total memory, the free list contains three blocks of 256MiB each, and we wish to allocate 512MiB. Even though the free list contains a total of 768MiB, we cannot satisfy this request with any single block. From the GPUs point of view, the 768MiB in the free list are still in use, and hence we will fail if we ask the GPU for an allocation of 512MiB. Our only recourse is to start taking memory out of the free list and handing it back to the GPU, until the GPU driver is willing to satisfy the 512MiB allocation request. This is potentially a time-consuming process.
Traditional malloc()
implementations deal with this issue by a
variety of means. Unfortunately, we cannot simply use the existing
literature on good malloc()
implementations. The basic technique
used by malloc()
is the ability to coalesce neighbouring blocks
of free memory into a single larger block. This helps reduce
fragmentation. However, this technique is not available to us. The
problem is that the OpenCL API does not make GPU memory available to
us as numeric pointers that we can manipulate. Rather, we interact
with opaque cl_mem
objects. There is no way to combine two
cl_mem
objects into a single larger cl_mem
object, nor to
split a cl_mem
object in two. This sharply limits how clever we
can be about managing the free list.
In the absence of additional information about allocation patterns, the best solution is probably the straightforward approach we sketched above - let the free list grow without bound, and hand memory back to the GPU once allocations start failing. This is the strategy used by the MemoryPool in PyOpenCL, and it really does work well.
But in Futhark, we do have more information, and we can do better. The trick is that we can write an allocator specifically tailored to the allocation patterns of code generated by the Futhark compiler. When the Futhark compiler generates an allocation, it takes the following form in the imperative IR used in the later parts of the compiler pipeline:
m_i = alloc(size)
Here, the i
is a statically unique label that identifies the
location of the allocation. The trick now is to tag all memory blocks
with the label of the allocation where they were produced and limit
the free list to one block per distinct label; handing excess blocks
back to the GPU. The approach is simple enough that we can show the
actual implementation we use in Futhark. First, a struct
for
describing an entry in the free list:
struct opencl_free_list_entry {
size_t size;
;
cl_mem memconst char *tag;
unsigned char valid;
}
We represent the labels as const char*
values. Since distinct
string literals will have distinct memory locations, we can just
compare pointers, and need not use strcmp()
. The valid
field
is used to avoid having to physically remove elements from the free
list itself, which is implemented as an array:
struct opencl_free_list {
struct opencl_free_list_entry *entries; // Pointer to entries.
int capacity; // Number of entries.
int used; // Number of valid entries.
};
The free list is initialised as follows:
void free_list_init(struct opencl_free_list *l) {
->capacity = 30; // Picked arbitrarily.
l->used = 0;
l->entries = malloc(sizeof(struct opencl_free_list_entry) * l->capacity);
lfor (int i = 0; i < l->capacity; i++) {
->entries[i].valid = 0;
l}
}
There are also some utility functions for destroying the free list and compacting it by removing invalid elements, but we will skip those. We have a function for inserting an entry into the free list:
void free_list_insert(struct opencl_free_list *l, size_t size, cl_mem mem, const char *tag);
We are eliding the implementation, because it is not very interesting;
most of the logic is about enlarging the array if there are no empty
(“invalid”) spots left. A more interesting function is
free_list_find()
, which finds a valid entry with a given tag:
/* Find and remove a memory block of at least the desired size and
tag. Returns 0 on success. */
int free_list_find(struct opencl_free_list *l,
const char *tag,
size_t *size_out, cl_mem *mem_out) {
int i;
for (i = 0; i < l->capacity; i++) {
if (l->entries[i].valid && l->entries[i].tag == tag) {
->entries[i].valid = 0;
l*size_out = l->entries[i].size;
*mem_out = l->entries[i].mem;
->used--;
lreturn 0;
}
}
return 1;
}
We use this in our implementation if opencl_alloc()
, which is the
actual function called by code generated by the Futhark compiler. We
will not show the definition of the opencl_context
structure; for
now we only need to know that it contains the actual free list (in the
free_list
field).
int opencl_alloc(struct opencl_context *ctx,
size_t min_size, const char *tag, cl_mem *mem_out) {
(min_size >= 0);
assertif (min_size < sizeof(int)) {
= sizeof(int);
min_size }
size_t size;
if (free_list_find(&ctx->free_list, tag, &size, mem_out) == 0) {
// Successfully found a free block. Is it big enough, but not too big?
if (size >= min_size && size <= min_size*2) {
return CL_SUCCESS;
} else {
// Not just right - free it.
int error = clReleaseMemObject(*mem_out);
if (error != CL_SUCCESS) {
return error;
}
}
}
// We have to allocate a new block from the driver.
int error;
*mem_out = clCreateBuffer(ctx->ctx, CL_MEM_READ_WRITE, min_size, NULL, &error);
return error;
}
The main heuristic here is that we refuse to re-use an existing block
if it is more than twice as big as needed. Otherwise we run the risk
of internal fragmentation, where we are only using a fraction of
each allocated block. A normal malloc()
implementation would just
split the block, but that’s not an option with OpenCL.
Generated code frees memory by calling opencl_free()
:
int opencl_free(struct opencl_context *ctx, cl_mem mem, const char *tag) {
size_t size;
;
cl_mem existing_mem
// If there is already a block with this tag, then remove it.
if (free_list_find(&ctx->free_list, tag, &size, &existing_mem) == 0) {
int error = clReleaseMemObject(existing_mem);
if (error != CL_SUCCESS) {
return error;
}
}
int error = clGetMemObjectInfo(mem, CL_MEM_SIZE, sizeof(size_t), &size, NULL);
if (error == CL_SUCCESS) {
(&ctx->free_list, size, mem, tag);
free_list_insert}
return error;
}
That’s really all there is to it. While this implementation still has
some weaknesses (such as not removing elements from the free list of
clCreateBuffer()
calls fail), and some heuristics to tune (maybe
we can steal an allocation with another label rather than going
straight to clCreateBuffer()
), it works well. Importantly, it
ensures that the free list can contain more elements than there are
distinct points of allocations in the program. Dynamically, the
effect of this allocator is that an allocation inside a Futhark
function will tend to be serviced by re-using the same memory that was
used last time the function was called.
The reason this very simple allocator works is because we can assume
that the label for an allocation is a reasonable identifying
characteristic. In a C program, this would not fly: not only are many
malloc()
s found within utility functions called from a wide
variety of locations, but the number of allocations is also
significantly higher. Due to the aggressive inlining performed by the
Futhark compiler, and in general the very simple dynamic behaviour of
a compiled Futhark program, we can get away with policies that would
not work in a less constrained environment.
Impact
We knew that this memory manager would have an effect on a few programs where the compiler generated particularly nice code, but we were surprised at the impact it had even on programs that we did not expect to be bottlenecked by allocation speed:
futhark-benchmarks/accelerate/canny/canny.fut
data/lena256.in: 9.06x
data/lena512.in: 9.18x
futhark-benchmarks/accelerate/kmeans/kmeans.fut
data/k5_n50000.in: 1.27x
data/k5_n200000.in: 1.10x
futhark-benchmarks/accelerate/ray/trace.fut
#0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 4..."): 1.30x
futhark-benchmarks/finpar/OptionPricing.fut
OptionPricing-data/small.in: 1.10x
OptionPricing-data/large.in: 1.04x
OptionPricing-data/medium.in: 1.26x
futhark-benchmarks/jgf/crypt/crypt.fut
crypt-data/medium.in: 2.12x
futhark-benchmarks/rodinia/bfs/bfs_parallel_mapwrite.fut
data/4096nodes.in: 2.03x
data/graph1MW_6.in: 1.04x
data/512nodes_high_edge_variance.in: 2.93x
futhark-benchmarks/rodinia/lavaMD/lavaMD.fut
data/10_boxes.in: 1.11x
futhark-benchmarks/rodinia/lud/lud.fut
data/2048.in: 1.15x
data/512.in: 1.16x
data/256.in: 1.13x
The above shows speedup on a range of Futhark benchmark programs on various datasets when using this new allocator compared to the old approach. The speedups are due to the common benchmarking methodology performing several runs of each program, with the free list being left intact after every run, thus letting subsequent runs perform memory allocations without communicating with the GPU at all. This is quite similar to performing multiple runs to “warm up the caches”, or letting a JIT compiler do its work.
While some of these performance improvements were expected - we had for example long known that canny was bottlenecked by memory management - we also saw unexpected improvement in other benchmarks. Some, such as lavaMD, we already felt we were handling quite well, but this new allocator still managed to obtain an extra 10% performance. A surprise to be sure, but a welcome one.
The addition of more sophisticated memory management also had an effect on the transformations done by the compiler itself. Until now, we assumed that allocations would be expensive, and therefore the compiler went to great lengths to hoist them out of loops, even at the cost of performing extra memory copies. This hurt us particularly with stencil-like programs, which are fundamentally a sequential loop surrounding some parallel operation. Previously, we would always suffer an extra copy of the loop result. On benchmarks such as Hotspot, this resulted in the Futhark implementation running around 20% slower than the reference implementation from Rodinia. With the new allocator, allocations are much cheaper, and we can simply keep an allocation inside the body of the loop, which will then be served by a free list entry inserted by a previous iteration. Operationally, the effect is about the same as the conventional approach of implementing double-buffering via pointer-swapping. This change provided a 40% speedup on Hotspot, meaning that the Futhark implementation is now faster than the reference implementation. Our previous behaviour on stencils was a little embarrassing, so this is a very nice improvement.
Conclusions
I am personally happy with the performance improvements, of course, but also a little miffed. It turns out that a relatively little change to memory allocation had a greater impact on performance than most of our sophisticated optimisations, all of which took significantly more effort to implement. But I suppose that such is the life of a compiler writer.