Running Futhark on the NVIDIA Jetson Nano
I recently got my hands on an NVIDIA Jetson Nano, which NVIDIA describes as “a small, powerful computer that lets you run multiple neural networks in parallel”. In practice, it resembles a souped-up Raspberry Pi, with a quad-core ARM CPU, a 128-core Maxwell-based NVIDIA GPU, 4GiB of RAM, and a power consumption of 5W. Quite slow compared to a real computer, but fast enough that you can do interesting things with it. Some people are using them for self-driving cars or automated doorbells, but I’ll probably just make it render pretty fractals on the wall display in my office. Since I long ago exceeded my tolerance for writing GPU code by hand, the first step is of course to figure out a way to run Futhark on the device. While the Jetson does not support OpenCL, The Futhark compiler now has a CUDA backend, so it should be possible. This blog post documents how to get it working.
I’ll be assuming that you have a freshly installed Jetson Nano with a
working CUDA setup, meaning that you can run nvcc
in the command
line and compile CUDA programs. For inexplicable reasons, NVIDIA does
not set the environment variables correctly out of the box, but
setting the following should take care of it:
export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
You will need a root partition with at least 32GiB of space.
There are two ways of running Futhark code on the Jetson:
- Run
futhark cuda
on some other machine, copy the generated.c
file to the Jetson, and then compile to a binary there. Since the C code generated by the Futhark compiler is not machine specific, it can easily be moved. - Run an ARM build of the Futhark compiler on the Jetson itself.
I’ll cover the former option first, since it is much simpler. When
you run futhark cuda foo.fut
, the Futhark compiler will generate a
file foo.c
and a binary foo
. You can then move that foo.c
to the Jetson and compile it with:
$ gcc foo.c -o foo -O -std=c99 -lm -lcuda -lnvrtc
Note that if your host system does not itself support CUDA,
compilation of foo
will fail. However, foo.c
remains
generated, so you can still copy it to the Jetson and finish
compilation there. It’s not pretty, but it works. If you use
futhark cuda --library
, which you likely will for real use,
then gcc
is not invoked for you, so you will not see any error.
Compiling the Futhark compiler on the Jetson
The Jetson uses an ARM CPU, and Futhark binary releases are currently only available for x86-64. Hence, we’ll have to recompile the Futhark compiler from scratch. This is normally a straightforward procedure, but a little more tricky when using an exotic architecture (ARM) and a small machine (the Jetson). Specifically, The Futhark compiler is written in Haskell, and while the Glasgow Haskell Compiler (GHC) does support ARM, it is not a so-called “tier 1 platform”, meaning that binary releases are spotty. This looks like it will change in the future, but for now, it takes some effort to get a usable Haskell infrastructure set up on the Jetson.
Ideally, we’d cross-compile an ARM build of Futhark from a beefier machine, but cross-compiling is notoriously difficult, and I could not get it to work. Instead, we’ll compile Futhark on the Jetson itself. Futhark uses the Stack build tool, which fortunately comes compiled for ARM:
$ curl -sSL https://get.haskellstack.org/ | sh
Unfortunately, Futhark’s Stack configuration specifies GHC 8.6.5, and the newest official binary release of GHC on ARM is 8.4.2. While in theory we could use GHC 8.4.2 to compile GHC 8.6.5 on the Jetson, this would take an extremely long time. Instead, we will be using the Nix package manager, which has binary releases of recent GHCs. Installing Nix is non-invasive (we will not be using all of NixOS, which would definitely be invasive):
$ curl https://nixos.org/nix/install | sh
While this saves us from compiling GHC itself, we still have to
compile a lot of Haskell, and GHC always hungers for memory. First,
GHC uses too too much RAM-disk space (specifically /var/run
), and
the default cap of 10% of physical memory is not sufficient. Edit
/etc/systemd/logind.conf
and set RuntimeDirectorySize=30%
.
Reboot after this. If you have more systemd knowledge than I, maybe
you can avoid the reboot.
RAM-wise, the Jetson’s 4GiB is not enough. Therefore, set up a 4GiB swap file:
# sudo fallocate -l 4G /swapfile
# sudo chmod 600 /swapfile
# sudo mkswap /swapfile
# sudo swapon /swapfile
This setup is transient, meaning it’ll go away on next reboot, but
you’ll have to delete /swapfile
yourself.
Now clone the Futhark Git repository as usual, cd
into it, and
run:
$ stack --nix install --fast -j1
The --nix
part tells stack
to fetch GHC from Nix, rather than
use a non-existent official release. --fast
disables the Haskell
optimiser, which saves on time and space. -j1
limits concurrency
to one job, also to limit memory usage. You may be able to bump this
higher (say, -j4
) to speed up compilation. If the build crashes
at some point due to an out-of-memory situation, simply reduce it to
-j1
and carry on. All dependencies that managed to be
succesfully built should still be available.
The build need not finish in one sitting, which is good, because this
will take a long time. When it’s done, you’ll have a futhark
binary located in $HOME/.local/bin
. To verify that it works, try
running part of the Futhark test suite:
$ futhark test --backend=cuda examples
Hopefully, it should work. Congratulations! You can now compile and run Futhark programs on the Jetson. There are no other Jetson-specific considerations that I have noticed. Unfortunately, the CUDA backend is for C, not Python, although we may implement a PyCUDA backend some day. If you want to easily show some graphics, consider Lys, which will certainly also be the topic of a future blog post.