Which tokens are the most frequently used in Futhark programs?

Posted on December 20, 2024

Have you ever wondered which tokens are the most frequently used in Futhark programs? By the end of this post you will wonder no more. One of the more obscure commands provided by the futhark binary is futhark tokens, which takes a program, runs it through the lexer, and prints the constituent tokens. For example, given a program foo.fut

def main x = x + 2

we may obtain the following:

$ futhark tokens foo.fut
DEF
ID (Name "main")
ID (Name "x")
EQU
ID (Name "x")
SYMBOL Plus [] (Name "+")
NATLIT (Name "2") 2

The original purpose of this command was to help in developing the formatter, although it’s also mildly handy for debugging the lexer. But how often do you really modify a lexer? Another thing it can be used for is performing crude statistical analysis on Futhark programs, answering interesting questions such as are all parens matched? or what is the most common variable name?. You could certainly write proper programs to answer these questions on Futhark syntax trees, but it’s rather more convenient to feed futhark tokens into a ~~shell script~~ sophisticated data analysis pipeline.

As the subject of my investigation, I use the Futhark benchmark suite - a collection of about 30k source lines of Futhark code (albeit with some duplication), written by various programmers.

First we obtain the raw data:

$ find . -name \*.fut -exec futhark tokens {} \; > tokens.txt

This produces a file containing 327469 lines, which I will not reproduce in its entirety. The Futhark lexer is a bit unusual in that comments are not disregarded, but actually turned into tokens. While these tokens are disregarded by the parser, they are useful to the formatter, which must reinsert them in appropriate locations while formatting the program. I count the number of comments thus:

$ grep ^COMMENT tokens.txt | wc -l
9450

Actually this is not entirely correct, as documentation comments are a different token, so I try again:

$ grep -E '^(DOC|COMMENT)' tokens.txt | wc -l
11116

Out of 327469 tokens, 11116 are comments.

Now let us count things that are by definition more meaningful than comments. Let us compute the number of occurences of each token. Since the name of a variable (or the contents of a comment) is part of the token, this also lets us find the most frequently used variable names:

$ sort tokens.txt |uniq -c | sort -n
...
   4425 ID (Name "x")
   5194 ID (Name "n")
   5573 ID (Name "i32")
   5652 DEF
   6495 LBRACKET
   7218 LET
  10273 RBRACKET
  10395 DOT
  12028 COMMA
  14180 COLON
  18411 EQU
  24604 LPAR
  24604 RPAR

The two most frequently used tokens (evenly tied) are ( and ). This is not such a great surprise. We find = in a distant third. Interestingly, let is used only a bit more than def. This suggests that many Futhark definitions do not have very many local bindings.

One surprising result is that the token RBRACKET, corresponding to ], is used 10273 times, while the token LBRACKET ([) is only used 6495 times. Doesn’t Futhark require matching use of square brackets? It does, but the lexer plays games with [ depending on what comes after. In Futhark, we want to distinguish x [i] (applying a function to an array literal) from x[i] (indexing an array). The solution we decided upon is for the lexer to produce a different token, INDEXING when a [ follos an identifier. And indeed, we find 3626 occurrences of INDEXING in the token list. This still doesn’t quite add up, but we also use a similar trick to detect #[ (for attributes), which makes up the difference.

As shown in the output excerpt above, the single most used identifier in Futhark programs is i32. I strongly suspect this is not because of its use in variable names, but rather as a reference to the type in explicit type ascriptions. The next most used name is n, which is almost certainly due to its common use in size parameters. The name x is of course also a popular choice when we don’t know what else to call something. The most popular symbolic name is +, where I am again supposing that this is not because of its use as a variable name.

We can also answer more exotic questions. For example, the longest variable name at 37 letters is flux_contribution_nb_density_energy_z. The average length of a variable name is 4, and the median is 3. A total of 13151 number literals are used, but only 963 distinct ones (ignoring syntactic differences), with the five most common being 0, 1, 2, 1.0, and 0.0. The constant 255 is also pretty common, with 148 occurences. Generally the most common numbers are the constants you’d expect (such as powers of two), with 15241094284759029579 being a weird outlier at 32 occurrences - it turns out this is originally a hexadecimal constant used in a random number generator. The sum of all number literals is 712554951466320527360, but the median is only 1. As usual, those with the largest numbers exhibit a disproportional influence.