-- # Removing duplicates -- -- The most straightforward way to remove duplicate elements from an -- array is to sort the array, then use a `filter` to keep only those -- elements that are either first in the resulting array or differ -- from the following element. First we import a sorting function -- from the [merge sort example](merge-sort.html). module merge = import "merge-sort" -- Then we write a function that removes consecutive duplicate -- elements. def neq lte x y = if x `lte` y then !(y `lte` x) else true def pack lte xs = zip3 (indices xs) xs (rotate (-1) xs) |> filter (\(i,x,y) -> i == 0 || neq lte x y) |> map (.1) def pack_i32 = pack (i32.<=) -- > pack_i32 [3,0,0,1,1,1,2,2,3,1] -- And then by sorting first, a function that removes any duplicate -- elements. def dedup_sort lte xs = merge.sort lte xs |> pack lte def dedup_sort_i32 = dedup_sort (i32.<=) -- > dedup_sort_i32 [3,0,0,1,1,1,2,2,3,1] -- One downside of this definition is that the original element order -- is not preserved, as shown above. We can address this by -- associating the original indexes with the input sequence and -- sorting by those at the end. def nub lte xs = zip xs (indices xs) |> merge.sort (\(a,_) (b,_) -> a `lte` b) |> pack (\(a,_) (b,_) -> a `lte` b) |> merge.sort (\(_,i) (_,j) -> i <= j) |> map (.0) def nub_i32 = nub (i32.<=) -- > nub_i32 [3,0,0,1,1,1,2,2,3,1] -- This is a general solution, but all this sorting is quite -- inefficient. If we know we will be operating on numbers or -- number-like data, we can use a [radix sort](radix-sort.html) to -- make the sorting a bit faster. However, if we can make assumptions -- about the input, even more efficient approaches are possible. -- -- For example, if we are deduplicating `k` numbers in the range -- `[0,n-1]`, where `k` is much larger than `n`, we can use -- [scatter](gather-and-scatter.html) to figure out which numbers are -- contained, followed by a filter: def dedup_scatter [k] (n: i64) (xs: [k]i64) : []i64 = scatter (replicate n false) xs (replicate k true) |> zip (iota n) |> filter (.1) |> map (.0) -- -- > dedup_scatter 4 [3,0,0,1,1,1,2,2,3,1] -- Note that this also does not preserve original element order. -- -- If we can *characterise* the elements to deduplicate by an integer, -- but they also contain other information we wish to preserve, we -- need to use a [generalised histogram](histograms.html). The idea -- is similar to the `scatter` above, but instead of storing -- true/false, we store the *lowest index of an element that hits that -- bucket*, then afterwards each element checks whether it "won". def nub_hist [k] 't (n: i64) (is: [k]i64) (xs: [k]t) : []t = let H = hist i64.min k n is (indices xs) in map2 (\i j -> H[i] == j) is (indices xs) |> zip xs |> filter (.1) |> map (.0) def nub_hist_i32 n is xs : []i32 = nub_hist n is xs -- > nub_hist_i32 4 [3,0,0,1,1,1,2,2,3,1] [0,1,2,3,4,5,6,7,8,9] -- This prefers the first occurrence of an element. It's -- straightforward to change it to prefer the last occurrence instead. -- One particularly crucial aspect of this definition is that the -- operator we use for the histogram, `i64.min`, is likely to be -- directly supported in hardware, meaning the histogram will be -- computed very efficiencly.