Source file: removing-duplicates.fut

Removing duplicates

The most straightforward way to remove duplicate elements from an array is to sort the array, then use a filter to keep only those elements that are either first in the resulting array or differ from the following element. First we import a sorting function from the merge sort example.

module merge = import "merge-sort"

Then we write a function that removes consecutive duplicate elements.

def neq lte x y = if x `lte` y then !(y `lte` x) else true

def pack lte xs =
  zip3 (indices xs) xs (rotate (-1) xs)
  |> filter (\(i,x,y) -> i == 0 || neq lte x y) |> map (.1)

def pack_i32 = pack (i32.<=)

> pack_i32 [3,0,0,1,1,1,2,2,3,1]

[3i32, 0i32, 1i32, 2i32, 3i32, 1i32]

And then by sorting first, a function that removes any duplicate elements.

def dedup_sort lte xs = merge.sort lte xs |> pack lte

def dedup_sort_i32 = dedup_sort (i32.<=)

> dedup_sort_i32 [3,0,0,1,1,1,2,2,3,1]

[0i32, 1i32, 2i32, 3i32]

One downside of this definition is that the original element order is not preserved, as shown above. We can address this by associating the original indexes with the input sequence and sorting by those at the end.

def nub lte xs =
  zip xs (indices xs)
  |> merge.sort (\(a,_) (b,_) -> a `lte` b)
  |> pack (\(a,_) (b,_) -> a `lte` b)
  |> merge.sort (\(_,i) (_,j) -> i <= j)
  |> map (.0)

def nub_i32 = nub (i32.<=)

> nub_i32 [3,0,0,1,1,1,2,2,3,1]

[3i32, 0i32, 2i32, 1i32]

This is a general solution, but all this sorting is quite inefficient. If we know we will be operating on numbers or number-like data, we can use a radix sort to make the sorting a bit faster. However, if we can make assumptions about the input, even more efficient approaches are possible.

For example, if we are deduplicating k numbers in the range [0,n-1], where k is much larger than n, we can use scatter to figure out which numbers are contained, followed by a filter:

def dedup_scatter [k] (n: i64) (xs: [k]i64) : []i64 =
  scatter (replicate n false) xs (replicate k true)
  |> zip (iota n)
  |> filter (.1)
  |> map (.0)

> dedup_scatter 4 [3,0,0,1,1,1,2,2,3,1]

[0i64, 1i64, 2i64, 3i64]

Note that this also does not preserve original element order.

If we can characterise the elements to deduplicate by an integer, but they also contain other information we wish to preserve, we need to use a generalised histogram. The idea is similar to the scatter above, but instead of storing true/false, we store the lowest index of an element that hits that bucket, then afterwards each element checks whether it “won”.

def nub_hist [k] 't (n: i64) (is: [k]i64) (xs: [k]t) : []t =
  let H = hist i64.min k n is (indices xs)
  in map2 (\i j -> H[i] == j) is (indices xs)
     |> zip xs
     |> filter (.1)
     |> map (.0)

def nub_hist_i32 n is xs : []i32 = nub_hist n is xs

> nub_hist_i32 4 [3,0,0,1,1,1,2,2,3,1] [0,1,2,3,4,5,6,7,8,9]

[0i32, 1i32, 3i32, 6i32]

This prefers the first occurrence of an element. It’s straightforward to change it to prefer the last occurrence instead. One particularly crucial aspect of this definition is that the operator we use for the histogram, i64.min, is likely to be directly supported in hardware, meaning the histogram will be computed very efficiencly.