Running Futhark on the NVIDIA Jetson Nano

Posted on June 22, 2019 by Troels Henriksen

I recently got my hands on an NVIDIA Jetson Nano, which NVIDIA describes as “a small, powerful computer that lets you run multiple neural networks in parallel”. In practice, it resembles a souped-up Raspberry Pi, with a quad-core ARM CPU, a 128-core Maxwell-based NVIDIA GPU, 4GiB of RAM, and a power consumption of 5W. Quite slow compared to a real computer, but fast enough that you can do interesting things with it. Some people are using them for self-driving cars or automated doorbells, but I’ll probably just make it render pretty fractals on the wall display in my office. Since I long ago exceeded my tolerance for writing GPU code by hand, the first step is of course to figure out a way to run Futhark on the device. While the Jetson does not support OpenCL, The Futhark compiler now has a CUDA backend, so it should be possible. This blog post documents how to get it working.

I’ll be assuming that you have a freshly installed Jetson Nano with a working CUDA setup, meaning that you can run nvcc in the command line and compile CUDA programs. For inexplicable reasons, NVIDIA does not set the environment variables correctly out of the box, but setting the following should take care of it:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

You will need a root partition with at least 32GiB of space.

There are two ways of running Futhark code on the Jetson:

Run futhark cuda on some other machine, copy the generated .c file to the Jetson, and then compile to a binary there. Since the C code generated by the Futhark compiler is not machine specific, it can easily be moved.
Run an ARM build of the Futhark compiler on the Jetson itself.

I’ll cover the former option first, since it is much simpler. When you run futhark cuda foo.fut, the Futhark compiler will generate a file foo.c and a binary foo. You can then move that foo.c to the Jetson and compile it with:

$ gcc foo.c -o foo -O -std=c99 -lm -lcuda -lnvrtc

Note that if your host system does not itself support CUDA, compilation of foo will fail. However, foo.c remains generated, so you can still copy it to the Jetson and finish compilation there. It’s not pretty, but it works. If you use futhark cuda --library, which you likely will for real use, then gcc is not invoked for you, so you will not see any error.

Compiling the Futhark compiler on the Jetson

The Jetson uses an ARM CPU, and Futhark binary releases are currently only available for x86-64. Hence, we’ll have to recompile the Futhark compiler from scratch. This is normally a straightforward procedure, but a little more tricky when using an exotic architecture (ARM) and a small machine (the Jetson). Specifically, The Futhark compiler is written in Haskell, and while the Glasgow Haskell Compiler (GHC) does support ARM, it is not a so-called “tier 1 platform”, meaning that binary releases are spotty. This looks like it will change in the future, but for now, it takes some effort to get a usable Haskell infrastructure set up on the Jetson.

Ideally, we’d cross-compile an ARM build of Futhark from a beefier machine, but cross-compiling is notoriously difficult, and I could not get it to work. Instead, we’ll compile Futhark on the Jetson itself. Futhark uses the Stack build tool, which fortunately comes compiled for ARM:

$ curl -sSL https://get.haskellstack.org/ | sh

Unfortunately, Futhark’s Stack configuration specifies GHC 8.6.5, and the newest official binary release of GHC on ARM is 8.4.2. While in theory we could use GHC 8.4.2 to compile GHC 8.6.5 on the Jetson, this would take an extremely long time. Instead, we will be using the Nix package manager, which has binary releases of recent GHCs. Installing Nix is non-invasive (we will not be using all of NixOS, which would definitely be invasive):

$ curl https://nixos.org/nix/install | sh

While this saves us from compiling GHC itself, we still have to compile a lot of Haskell, and GHC always hungers for memory. First, GHC uses too too much RAM-disk space (specifically /var/run), and the default cap of 10% of physical memory is not sufficient. Edit /etc/systemd/logind.conf and set RuntimeDirectorySize=30%. Reboot after this. If you have more systemd knowledge than I, maybe you can avoid the reboot.

RAM-wise, the Jetson’s 4GiB is not enough. Therefore, set up a 4GiB swap file:

# sudo fallocate -l 4G /swapfile
# sudo chmod 600 /swapfile
# sudo mkswap /swapfile
# sudo swapon /swapfile

This setup is transient, meaning it’ll go away on next reboot, but you’ll have to delete /swapfile yourself.

Now clone the Futhark Git repository as usual, cd into it, and run:

$ stack --nix install --fast -j1

The --nix part tells stack to fetch GHC from Nix, rather than use a non-existent official release. --fast disables the Haskell optimiser, which saves on time and space. -j1 limits concurrency to one job, also to limit memory usage. You may be able to bump this higher (say, -j4) to speed up compilation. If the build crashes at some point due to an out-of-memory situation, simply reduce it to -j1 and carry on. All dependencies that managed to be succesfully built should still be available.

The build need not finish in one sitting, which is good, because this will take a long time. When it’s done, you’ll have a futhark binary located in $HOME/.local/bin. To verify that it works, try running part of the Futhark test suite:

$ futhark test --backend=cuda examples

Hopefully, it should work. Congratulations! You can now compile and run Futhark programs on the Jetson. There are no other Jetson-specific considerations that I have noticed. Unfortunately, the CUDA backend is for C, not Python, although we may implement a PyCUDA backend some day. If you want to easily show some graphics, consider Lys, which will certainly also be the topic of a future blog post.