Here’s a simple problem, with solutions in Perl and Haskell.
@joel: suppose I have a list of strings (they happen to be paths) - how might I find only the longest instance of each path? @joel: that is give /foo /foo/bar /foo/bar/baz /qux I only want back /foo/bar/baz and /qux @joel: do I need to build a trie myself? my cpanfu is failing me
As this was a Perl channel, Penfold suggested splitting on “/” into a multilevel hash, then
outputting the leaf nodes. Then pjcj suggested, performance permitting, to go through the
list deleting each entry that is a substring of the previous entry: that is, treating the
strings simply as strings of characters.
This makes it sound like you have to compare each string with the strings shorter than it,
but it turns out that you can get exactly the right behaviour on an asciibetically sorted
list. For example, a set of paths similar to the ones Joel quoted might get sorted as:
- /foo
- /foo/bar
- /foo/bar/baz
- /qux
- /qux/wibble
You can see that this should reduce to
- /foo/bar/baz
- /qux/wibble
as the list collapses the previous element every time we realise that it is a substring
prefix of the following element.
We can do this in a single pass using a fold, which is a pattern that’s less common in
Perl than other idioms from functional programming like map and grep, but
which is implemented in List::Util‘s reduce. A typical example of reduce
might be this:
sub sum { reduce { $a+$b } @_; } print sum (1,2,3); # = 6
Of course there the two elements $a and $b are the same kind of thing (a number) and the
reducing function ($a+$b) returns a number in turn. But for the longest paths problem,
we need something else: a list of results. This is a common idiom in Haskell: using an
“accumulator” as the initial value. However Perl’s reduce uses the first element
of the list as the initial value. That’s often a very good strategy often known as
foldl1, but for a long time I thought that the Perl implementation was weak as it
didn’t allow a more flexible arbitrary value. I was wrong. LeoNerd pointed out in #moose
that you can just pass the init value as the first element of the list! (This works in Perl
of course, as lists are untyped). For example, this subroutine would be a (rather silly)
way of turning a list into a list reference:
sub as_list { reduce { [@$a,$b] } [], @_ }
So, we’re going to call our function like this:
my $longest_list = longest qw( /foo /qux/wibble /foo/bar/baz /qux /foo/bar );
And we can implement it like this:
sub longest { reduce { my ($acc, $val) = ($a, $b); # 'accumulator' and 'value' my $last = pop @$acc || ''; # The "last" value is an empty string # the first time around # We return a list reference, which will be the accumulator # the next time around [ @$acc, $val =~ /^\Q$last\E/ ? () : ($last), # collapsed, if substring $val ] } [], # initial acumulator (empty list) sort @_; # sorted input list }
OK, so I promised myself I’d sketch this in Haskell too. It should be slightly simpler
as the hackish popping off the end of the list is replaced with a nice
pattern match.
longest :: [String] -> [String] longest = foldl aux [] . sort where aux [] v = [v] aux xss@(x:xs) v = v : (if x `isPrefixOf` v then xs else x:xs)
As often happens when converting an algorithm to Haskell, the result list is
backwards, as we’re consing the new result rather than appending to it. But in
fact, it’s not especially elegant: the conditional concatenation is a bit
clumsy for example. I asked for feedback on #haskell. And, as so often
happens on that channel, I got an altogether more compact version: augustss
suggested something like:
longest2 = nubBy (flip isPrefixOf) . sortBy (flip compare)
nubBy is precisely the pattern that reduces a list based on whether adjacent elements
are to be collapsed or not. (Apparently you may need to reverse the sense of nubBy in
GHC 6.10 — I’m using 6.6.1 — which simplifies to nubBy isPrefixOf. Nothing
like backwards compatibility…)
Update: sortBy (flip compare) is dolio’s suggested improvement to the original reverse . sort. The original version does read better I think, but flipping the comparison is a bit more efficient than having to reverse the list after sorting it. It also starts to produce values lazily before finishing the sort.
Update: Pumpkin noted that a trie would be more efficient than nub… D’oh! nub, while it looks elegant and compact above is actually losing the benefit of us having sorted the list in the first place (it’s implemented using a set of chinese box filter functions – effectively the same as a linked list of comparisons. If we match the nub, this is very efficient — that’s the outermost box already! However if we’re not a prefix, then we’ll uselessly check if we’re a prefix of every other path, even though we couldn’t possibly be.
What we really need is precisely the semantics of Unix’s uniq, which expects a sorted list, and can therefore do a single pass, like our reduce version. Olathe pointed out a solution with map head . group which is almost what we want… except we need the -By version. Given a new function uniqBy, we just need to substitute it for nubBy and we’re suddenly efficient again!
uniqBy eq = map head . groupBy eq longest3 = uniqBy (flip isPrefixOf) . sortBy (flip compare)
Update 2009-01-27: It occurred to me that instead of all the flipping, we could simplify uniqBy by making it take the last element instead of the head.
longest4 = uniqBy' isPrefixOf . sort uniqBy' eq = map last . groupBy eq
Update: X pointed out that there’s a bug: ["/a", "/aa", "/aaa"] will all get smushed
together as they’re substrings. Actually whether this is a bug to the original specification
is debatable – Joel did mention “strings”, however he did also mention that they are file
paths, so let’s try to make them do the right thing:
In fact the basic core uniqBy' isPrefixOf . sort can remain exactly the same.
But instead of a single string "/foo/bar/baz" we’ll want to be passing
it a list of strings like ["foo","bar","baz"]. As Haskell’s list processing
is completely generic, all the building blocks like isPrefixOf and sort
will Just Work exactly as before! The only complication is splitting and joining the
strings, something that’s trivial in Perl. I
wrote about
splitting words in Haskell previously, but in the mean time Brent Yorgey has created
Data.List.Split
which should do exactly what we want. However I can’t install it on my laptop’s ubuntu packaged
ghc 6.6.1, so for now I’ll fall back to Text.Regex.splitRegex. Then, to join the
path separator again we need to intersperse it, but that doesn’t flatten the list,
so we end up with something like ["foo", "/", "bar"]. Control.Monad‘s
join function, generic as it is, actually does the right thing here. So we end
up with:
longest5 :: [String] -> [String] longest5 = map rejoin . uniqBy' isPrefixOf . sort . map split where split = splitRegex $ mkRegex "/" rejoin = join . intersperse "/"
We can test this against an input like [ "/a", "/aa", "/a/a" ] to check it’s all
correct.
The same approach should work in Perl, except that splitting/joining will be slightly simpler,
while the equivalent isPrefixOf logic is more complex, as it doesn’t have the same
generic list comparisons. I’ll leave that as an exercise to the reader :-)
Talking of which, thanks to Will for the Python/Erlang versions in the
comments, “Anon” for an Ocaml solution, and “Programmer” for a Perl challenge
I’ll come back to if time allows.
Update 2009-01-30: Will pointed out that my “improved” testcase isn’t. Better would be ["/a", "/aa", "/b", "/b/b"] to check that /a and /aa are distinct, but /b is merged into /b/b.
He also sent an improved Erlang version, thanks!