Fork me on GitHub

Tracking source locations

Posted on July 29, 2025

Futhark is a programming language meant for writing fast programs, but as is the case for every programming language meant for writing fast programs, it inevitably happens that a programmer will use it to write a program that is not fast. When this happens, the programmer will likely want to know why their program is not fast, and how to make it faster. A useful tool for answering such questions is a profiler - a tool that tells you how long the different parts of your program take to run. This post is about how profiling in Futhark became slightly more useful with the most recent release.

Initially, Futhark had no real profiling support, except for some semi-documented support for dumping a report of GPU operations. Eventually we added futhark profile, which allows the machine-readable profiling data produced by futhark bench to be turned into human-readable reports. Specifically, the Futhark runtime system will tally up the time spent in various cost centres, which for the GPU backends are GPU kernels and other operations such as copies, and put it in a table. However, the information you get out still looks like this:

                         Cost centre           count             sum             avg             min             max        fraction
------------------------------------------------------------------------------------------------------------------------------------
builtin#replicate_i8.replicate_25280              63        346.11μs          5.49μs          5.12μs          7.17μs          0.0019
copy_dev_to_dev                                    6         80.90μs         13.48μs          8.19μs         27.65μs          0.0004
copy_lmad_dev_to_dev                               1          8.19μs          8.19μs          8.19μs          8.19μs          0.0000
initOperator_6347.gpuseq_25229                     2         20.48μs         10.24μs          9.22μs         11.26μs          0.0001
initOperator_6347.replicate_25196                  2         22.53μs         11.26μs         11.26μs         11.26μs          0.0001
initOperator_6347.segmap_19237                     2         31.74μs         15.87μs         14.34μs         17.41μs          0.0002
main.gpuseq_26401                                 63        382.98μs          6.08μs          5.12μs          7.17μs          0.0021
main.gpuseq_26416                                 63        344.06μs          5.46μs          5.12μs          7.17μs          0.0019
main.replicate_25225                               1        184.32μs        184.32μs        184.32μs        184.32μs          0.0010
main.segmap_19296                                  1         23.55μs         23.55μs         23.55μs         23.55μs          0.0001
main.segmap_19320                                  1         13.31μs         13.31μs         13.31μs         13.31μs          0.0001
main.segmap_23375                                  1         13.31μs         13.31μs         13.31μs         13.31μs          0.0001
main.segmap_23407                                 63        408.58μs          6.49μs          6.14μs          7.17μs          0.0022
main.segmap_23494                                 63        421.89μs          6.70μs          6.14μs          8.19μs          0.0023
main.segmap_23553                                 63       9211.90μs        146.22μs        144.38μs        147.46μs          0.0501
main.segmap_intrablock_23610                      63     100120.57μs       1589.22μs       1573.89μs       1616.90μs          0.5445
main.segmap_intrablock_24377                      63      60806.14μs        965.18μs        955.39μs        985.09μs          0.3307
main.segscan_23448                                63        785.41μs         12.47μs         11.26μs         13.31μs          0.0043
map_transpose_4b                                 127      10655.74μs         83.90μs         80.90μs         89.09μs          0.0579
------------------------------------------------------------------------------------------------------------------------------------

Now a user may reasonable object: “Hold on! I don’t remember my program containing anything called main.segmap_23494!” And indeed, these cost centres refer to compiler-generated names. You can squint to get some meaning out of them: segscan is certainly some kind of scan operation, and segmap is a map. But due to inlining, it can be difficult to guess which functions result in which GPU operations, and optimisations may obscure the relation between source code and generated code - indeed, those segmap_intrablock operations are actually mainly (nested) scans that are then turned into block-level scans via incremental flattening. But clearly it is still not easy to use this information. The profiler will usually just tell the programmer that their program spends all its time executing code with a name the programmer cannot possibly recognise. What is missing is a way to relate generated code with the original source code. I decided to call such information provenance, in the sense of “the ultimate origin of something”. The problem is then to attach provenance to every bit of generated code, and in particular, to the generated GPU kernels.

Unfortunately, the Futhark compiler was never really designed to track provenance. The frontend always annotated all syntactic constructs with precise source information in order to do good error reporting, but it was thrown away during desugaring and lowering into the intermediate language (IR). To improve the usability of profiling, we had to actually propagate source information throughout the entire compiler - and I was quite uncertain about how practical it would be to make such a change to a ten year old code base comprising over a hundred thousand source lines of Haskell. The main challenge is that the Futhark compiler does a lot of optimisations where it rewrites part of the program. Manually propagating provenance is simply not viable or maintainable. We had to come up with some kind of semi-automated general scheme.

Further, there were some conceptual questions about how to track provenance across the compiler, due to the aggressive transformations it does. For example, the user may write something like this:

let b = map f a
let c = map g b
let d = map h c

The compiler may then perform fusion to essentially turn the program into this:

let d = map (\x -> h (g (f x))) a

Now what is the provenance of the resulting map? A reasonable answer is that it is simply the union of the three original lines of code. (Let us ignore the issue of what that looks like if those lines are not actually close to each other.) But the compiler may then later perform fission to split up the map again, if it decides that is the best way to exploit the parallelism in the mapped functions:

let tmp = map (\x -> g (f x)) a
let d = map h tmp

Now what is the provenance of the resulting maps? If we already merged together the provenances of the original three maps into the single fused one, then we have lost some of the original information, and our only choice is to consider that both of the two final maps originate in the three initial maps. In practice, the functions f, g, and h are also not black boxes, and may be intermixed with each other before being split apart.

This is not an edge case: the basic compilation scheme of the Futhark compiler is to fuse as much as possible based on data dependency information in order to achieve some kind of “normal form” representation, then statically schedule the resulting parallel nests via various flattening transformations, based on its understanding of the concrete parallel hardware (GPUs, for the purpose of this post). Naive tracking of provenance thus runs the risk of concluding that every part of the output depends on every part of the input. This is perhaps true in a philosophical sense, but it is not very useful to a programmer.

So this is the dilemma: we want something we don’t have to think about too much, but it also has to be quite precise.

What now

The solution to the first problem is to piggyback on an already existing facility for attaching information to IR statements, which we currently use for certificates. Essentially, each Futhark statement is associated with a set of certificates that function as fake data dependencies, which ensures that safety checks are always executed before the expression they protect. As this is safety-critical, we have fairly high confidence that the compiler propagates these correctly (because we fixed lots of bugs where it didn’t). The way it works is that each statement is associated with a StmAux structure, and whenever you write code that changes, merges, or splits some statement, you use a monadic facility for “propagating” the StmAux of the dependencies. In many cases, code that transforms a single statement will just reuse the original StmAux. By adding a provenance field to the StmAux, we could simply piggyback on all the existing code, as long as we defined a mechanism for merging two different provenances. This latter part remains a bit crude, and simply takes the union of source locations (with sometimes odd results of the source locations are from different files - but this can always be refined later, as it is well separated from everything else).

The second problem, when we cut apart maps, is a bit more tricky. Our solution was a bit of a convenient hack, but it has proven surprisingly accurate and useful in practice. The idea is to disregard the provenance of the maps themselves when deciding on the provenance of a generated GPU kernel, but only look at the provenances of the leaf scalar operations inside of them. Following up on the example above, it means that the provenance of the first map will be the union of the provenance of g and f, and the provenance of the second map will be the provenance of h. This has the slightly odd consequence that the source location of a GPU kernel never points to a map, but to the function (often a lambda) that it is passed, although this does sort of make sense in a way - the map itself is not the cost centre; it is just control flow. It is the stuff inside of it that is the actual computation.

There’s a little bit more to it - the provenance representation is not just a single source location, but can also embed a bit of a (static) stack trace, to properly track the case where the same code is inlined in multiple locations. After these changes, the timeline log produced by futhark profile now ends like this:

...
main.gpuseq_26416
Duration: 5.12 μs
At: LocVolCalib.fut:187:16-90->LocVolCalib.fut:175:33-177:69->LocVolCalib.fut:162:20-66->LocVolCalib.fut:140:14-41->LocVolCalib.fut:100:21-27
Kernel main.gpuseq_26416 with
  grid=(1,1,1)
  block=(1,1,1)
  shared memory=0

main.segmap_intrablock_24377
Duration: 956.416016 μs
At: LocVolCalib.fut:184:17-188:12
Kernel main.segmap_intrablock_24377 with
  grid=(65536,1,1)
  block=(256,1,1)
  shared memory=3072

map_transpose_4b
Duration: 81.919998 μs
At: LocVolCalib.fut:187:16-90->LocVolCalib.fut:175:33-177:69->LocVolCalib.fut:163:6-26
Kernel map_transpose_4b with
  grid=(8,8,256)
  block=(32,4,1)
  shared memory=4224

copy_lmad_dev_to_dev
Duration: 8.192 μs
At: LocVolCalib.fut:184:17-188:12
Kernel copy_lmad_dev_to_dev with
  grid=(1,1,1)
  block=(256,1,1)
  shared memory=0

This is perhaps not the ideal way of reporting the information, but it shows that it actually makes it all the way through the compiler, and can be used in its current form by programmers to understand how to speed up their Futhark programs. (Human trials still pending, since all the students here are on summer vacation.)

What next

The plumbing for tracking provenance now works, which was the part I was most worried about. There are probably places where the compiler propagates provenances in a less-than-useful way (or drops them entirely), which we will address on a case-by-case basis. But the remaining steps are about improving the usability of how the profiling information is presented.

One rather obvious problem is that the provenances are not actually made available in a particularly machine-readable way - they ultimately end up in a timeline log file as above. There is also not any convenient way to associate the table of cost centres, shown at the beginning of this post, with source locations, except by searching for the cost centres in the timeline log. These things are not so difficult to improve - it is just a matter of writing some code for data processing, which I am informed that computers are good at.

I think it is also possible to create a nice heat map of the program source code based on the profiling information, where hot sections of code will show up as red (or whichever colour the HCI people say is associated with speed - maybe blue?). This may be more approachable than reading long text files.

One more tricky problem is that although we track provenance at a fairly fine-grained level of individual scalar operations, the Futhark runtime system only tracks the runtime at the granularity of GPU operations. This means that the profiler can tell you that some particular kernel is slow, and also which span of source code that kernel corresponds to, but it cannot tell you which part of that source is the actual slow part. Fixing that is unfortunately a lot more tricky - we would somehow have to associate source locations with the actual GPU kernel code we generate, and then also make use of the profiling support of the underlying GPU platform API. I know that NVIDIA’s profiler is pretty good, so I think this is technically possible, but it will require significant engineering effort in our kernel code generator.

One very cool thing would be if the profiler could also make suggestions for why a kernel may be slow, phrased in the terminology of the Futhark language, rather than GPU jargon. In particular, one very common reason for why Futhark programs are slow is simply that there is not enough parallelism. It would not be too difficult for the profiler to detect that some very slow kernel consists of very few threads, and suggest that perhaps the program is too sequential. (Suggesting how to parallelise a sequential algorithm is perhaps too ambitious.)

I also need to take a more systematic look at how other compilers handle this problem. It maybe be that I spend a couple of weeks to save myself a day of reading.