There is no escape from Futhark
In 1981, Brian Kernighan wrote the article Why Pascal is Not My Favourite Programming Language. Programming language flame wars are an old tradition in our field, and while I don’t think most of the criticism raised in the article was novel at the time, the author’s fame ensured that it would be used as flame fuel by C afficionados thoughout the 80s and beyond. The criticisms cover much of the Pascal language, from minor details to conceptual disagreements. Most of the major problems were resolved in later Pascal revisions or by vendor-specific extensions, and it is my impression that Pascal fans became quite tired of refuting later C programmers who repeated Kernighan’s criticm - indeed, the FreePascal wiki contains a list of point-by-point refutations.
But despite the concrete criticism in Kernighan’s article being of only historical interest for most of my own life, there is one timeless section that has stayed with me. It’s the shortest in the article, so I’ll repeat it in its entirety:
2.6. There is no escape
There is no way to override the type mechanism when necessary, nothing analogous to the ``cast’’ mechanism in C. This means that it is not possible to write programs like storage allocators or I/O systems in Pascal, because there is no way to talk about the type of object that they return, and no way to force such objects into an arbitrary type for another use. (Strictly speaking, there is a large hole in the type-checking near variant records, through which some otherwise illegal type mismatches can be obtained.)
At first glance, this section is about a specific limitation in the type system: no casts between pointers. But the title of the section is the real lesson: there is no escape. Meaning, Pascal has strict rules, which is mostly good, but provides no way of breaking those rules when critically necessary. This is a problem, because when writing a program in a language with weak rules, you might have gotten the program wrong, but if you were careful, you might also have gotten it right! However, if the rules prevent you from writing the program at all, then that’s it. No way out.
I’m not interpreting Kernighan’s words more generally than he intended, as he himself clarifies in his conclusion:
The language is inadequate but circumscribed, because there is no way to escape its limitations. There are no casts to disable the type-checking when necessary. There is no way to replace the defective run-time environment with a sensible one, unless one controls the compiler that defines the “standard procedures.” The language is closed.
I’d like to write about what forms such “escape” can take, how it manifests in a few current languages, and why Futhark still offers no escape.
Offering an escape
The most obvious example of a language with plenty of ways to “escape”
is of course C itself. Especially the pre-ANSI C of 1981 offers
plenty of ways for you to escape the confines of the type system, or
anything else you might want. You can of course freely cast between
pointers of different types. You can take the address of a struct
or some other compound value, and treat it as raw memory with
memcpy()
operations (or just treat it as a char*
pointer and do
pointer arithmetic yourself). You can also do (probably) nonsensical
things like take the address of a function, cast it to int*
, and
then read from or write to that address. The compiler won’t stop you.
This is because underneath C’s language semantics, there is also a
machine semantics we can use to interpret the meaning of a
program. By this I mean both the abstract virtual machine used to
define C as a language in the specification, but also the intuitive
idea of how a C compiler generates machine code and represents values
in memory. While the virtual machine is the “correct” machine model,
most C programmers probably think intuitively about the physical
machine. The C specification itself tries rather carefully to specify
what kind of “escape” is allowed, in the sense of operations that the
language cannot guarantee will be safe (via the type system), but
are still required to compute something predictable. For example, the
void*
pointers returned by malloc()
can be cast to other pointer
types, and then used to store different objects. At least if a
sufficient size was passed to malloc()
- another thing the language
itself cannot check.
The important thing when adding a language escape mechanism is to ensure that you can still write correct code. Adding the ability to “escape” by writing to an arbitrary register in a C program at any time would not be very useful, because the odds of clobbering a register already in use would be high, and later changes to the compiler or program might change which registers are in use at any given point. Escaping must still allow for the construction of well-engineered, robust code.
Of course, escaping into the wilderness beyond the type system is always dangerous. Both implementation-defined and undefined behaviour lurks, and the distinction between the two is often not clear to many C programmers. Worse, C does not make syntactically clear when you escape from the rules, and it is easy to do so by accident.
Escaping in a low-level language
As contrast to C, consider a much newer language designed for more or less the same niche as C: Rust. Because Rust must be useful for the very lowest levels of systems programming, it ultimately has to provide the same avenues for escape as C. The programmer must somehow be able to mutilate memory in arbitrary ways. Yet we also know that even for those programs that need this kind of escape somewhere (say, an OS kernel), most parts of such programs do not. Even an OS kernel consists mostly of quite ordinary code that uses internal abstractions to communicate directly with hardware when necessary. The implementations of these abstraction may need to break the rules, but their callers do not.
This is why Rust provides the often misunderstood
unsafe
keyword,
which lets you locally escape from Rust’s otherwise sound type system.
I think unsafe
is unfortunately named - unchecked
would be a more
precise description of what is really going on - but perhaps unsafe
was chosen to discourage needless use. Code inside an unsafe
block
is not unsafe in the sense that it is necessarily fragile or will
arbitrarily break, but merely that the language cannot check that what
is going on is sane - the programmer is responsible for writing code
that does the right thing. In essence, all of a C program is
“unsafe” in Rust terms, while Rust allows more precise control of
exactly when we escape.
Mentioned for the sake of completeness, ATS is a language that allows fully safe low-level programming. The programmer is responsible for providing a machine-checkable proof that their low-level shenanigans are actually sensible. Providing such proofs can be quite difficult (see this talk by Aditya Siram), which is probably why very few systems languages go in this direction.
Escaping in a high-level language
Low-level languages are not the only ones that need to offer some kind of escape. While Kernighan mostly presented escape as about being able to play tricks with memory, I think the general principle applies also to languages that impose rules at a much higher level of abstraction.
As an example, consider Standard ML - one of the first widely used functional languages. Standard ML is not a pure language and offers features such as mutable references and arrays. Yet Standard ML is very much intended to be used in a functional style, and the majority of Standard ML code presents purely functional interfaces. Reference cells are still often used internally, behind purely functional abstraction boundaries, for performance reasons. We can see this as a form of “escape”, although we are still firmly within the Standard ML semantics themselves. If we make a mistake, then the behaviour of our program as a whole is still well-defined, although our claimed pure API might suddenly have observable side effects. Reflection mechanisms in languages like Java and C# are a similar example - while they don’t allow us to subvert the virtual machine itself, we can break the type systems of the languages as much as we wish.
Haskell has the suggestively named
unsafePerformIO
,
which (together with cousins such as
unsafeInterleaveIO
accursedUnutterablePerformIO
)
allows us to perform arbitrary IO operations in otherwise pure code.
These functions are really bad mojo, because the rules for when they
are safe to use are unclear, and the consequences of misuse subtle.
The only way to really understand when they are safe to use is to
understand the compilation model of your Haskell compiler, and how its
runtime system models IO. Yet if you do have this knowledge, then you
can use them to implement safe abstractions - much like unsafe
in
Rust.
Where “escaping” in a low-level language often lets you do almost anything, escaping in a high-level language is different. Haskell has no function, no matter how frightening its name, that will let you change the value of a local variable. This is because there would be no way to actually write robust code with such a function. In Haskell there is no expectation that a “variable” is eventually going to map to some storage location that can be modified. The compiler may duplicate, move, combine, or otherwise reorganise variables in all kinds of unpredictable ways. You can imagine Haskell providing unsafe functions that let you walk the heap at runtime (and indeed they exist), but it is hard to use them to build robust abstractions. They are primarily used to interoperate with other languages such as C.
Put another way, “escaping” lets us touch things that would normally not be safe to touch (or at least, that the language cannot verify would be safe), but it does not let us touch things that did not exist in the first place.
Escaping in Futhark
Almost all general-purpose languages provide various mechanisms for escape. It seems just as necessary today as it did in 1981. But as I alluded to in the introduction, Futhark does not provide any escape hatches. How come?
The easy answer is that Futhark isn’t a general-purpose language, so it’s fine and expected that not all programs can be written in Futhark. But beyond that, the reason is that the selling point of Futhark is the compiler rather than the language, and adding escape hatches would seriously limit the power of the compiler.
Recall that “escaping” is mostly about doing things that the compiler cannot verify, and may not even understand. This means that the compiler must be very careful not to modify the unsafe code too much, as this might violate the subtle properties that the correctness of the code depends on. For example, consider if we added the ability to treat Futhark arrays as flat byte sequences. Maybe that would be useful. But now the in-memory representation of arrays would matter to the programmer. This is unfortunate, because the Futhark compiler’s GPU pipeline carefully lays out arrays in memory to ensure coalesced memory access, based on its analysis of how the arrays are traversed by the program. The precise in-memory layout is an implementation detail, and varies based on the surrounding code. In fact, the compiler may actually decide that some array is best duplicated in memory, using two different representations, if it is accessed in performance-wise mutually incompatible ways (say, both by row and by column in different loops). It’s also quite common for the compiler to interleave different arrays when performing flattening of nested parallelism. It’s hard to imagine how one would write robust code that directly accessed the byte-sequence representation of arrays.
In most languages, the typical way to use the ability to “escape” is to hide away unsafe code behind safe APIs. If the API is implemented correctly, then the user has no need to care about what goes on behind the procedure calls. Even a compiler usually will not have to care - unless it inlines the functions, it can treat them as black boxes. Unfortunately, many Futhark optimisations depend crucially on inlining. When generating GPU code, the Futhark compiler often has to cross abstraction boundaries because having black-box functions that allocate memory or themselves contain parallellism will not perform well on a GPU. The compiler must take a more global view in order to generate code with acceptable performance.
To sum up, “escaping” is in most languages about allowing the programmer to operate directly on the (abstract or concrete) machine that the compiler targets. To preserve the compiler’s flexibility, we are unwilling to expose any part of the implied abstract machine used internally by the Futhark compiler. This is unusual among general-purpose languages, but not among similarly specialised languages. SQL also does not allow you to directly access the physical table or index layout, even when you ask nicely, presumably to ensure that the database system is able to transparently optimise these things.