Perl, Haskell, stuff
@joel: suppose I have a list of strings (they happen to be paths) -
how might I find only the longest instance of each path?
@joel: that is give /foo /foo/bar /foo/bar/baz /qux I only want back
/foo/bar/baz and /qux
@joel: do I need to build a trie myself? my cpanfu is failing me
As this was a Perl channel, Penfold suggested splitting on "/" into a multilevel hash, then
outputting the leaf nodes. Then pjcj suggested, performance permitting, to go through the
list deleting each entry that is a substring of the previous entry: that is, treating the
strings simply as strings of characters.
This makes it sound like you have to compare each string with the strings shorter than it, but it turns out that you can get exactly the right behaviour on an asciibetically sorted list. For example, a set of paths similar to the ones Joel quoted might get sorted as:
We can do this in a single pass using a fold, which is a pattern that's less common in Perl than other idioms from functional programming like map and grep, but which is implemented in List::Util's reduce. A typical example of reduce might be this:
sub sum {
reduce { $a+$b } @_;
}
print sum (1,2,3); # = 6
Of course there the two elements $a and $b are the same kind of thing (a number) and the
reducing function ($a+$b) returns a number in turn. But for the longest paths problem,
we need something else: a list of results. This is a common idiom in Haskell: using an
"accumulator" as the initial value. However Perl's reduce uses the first element
of the list as the initial value. That's often a very good strategy often known as
foldl1, but for a long time I thought that the Perl implementation was weak as it
didn't allow a more flexible arbitrary value. I was wrong. LeoNerd pointed out in #moose
that you can just pass the init value as the first element of the list! (This works in Perl
of course, as lists are untyped). For example, this subroutine would be a (rather silly)
way of turning a list into a list reference:
sub as_list {
reduce { [@$a,$b] } [], @_
}
So, we're going to call our function like this:
And we can implement it like this:my $longest_list = longest qw( /foo /qux/wibble /foo/bar/baz /qux /foo/bar );
sub longest {
reduce {
my ($acc, $val) = ($a, $b); # 'accumulator' and 'value'
my $last = pop @$acc || ''; # The "last" value is an empty string
# the first time around
# We return a list reference, which will be the accumulator
# the next time around
[ @$acc,
$val =~ /^\Q$last\E/ ? () : ($last), # collapsed, if substring
$val ]
}
[], # initial acumulator (empty list)
sort @_; # sorted input list
}
OK, so I promised myself I'd sketch this in Haskell too. It should be slightly simpler
as the hackish popping off the end of the list is replaced with a nice
pattern match.
longest :: [String] -> [String]
longest = foldl aux [] . sort
where aux [] v = [v]
aux xss@(x:xs) v = v :
(if x `isPrefixOf` v
then xs
else x:xs)
As often happens when converting an algorithm to Haskell, the result list is
backwards, as we're consing the new result rather than appending to it. But in
fact, it's not especially elegant: the conditional concatenation is a bit
clumsy for example. I asked for feedback on #haskell. And, as so often
happens on that channel, I got an altogether more compact version: augustss
suggested something like:
nubBy is precisely the pattern that reduces a list based on whether adjacent elements are to be collapsed or not. (Apparently you may need to reverse the sense of nubBy in GHC 6.10 — I'm using 6.6.1 — which simplifies to nubBy isPrefixOf. Nothing like backwards compatibility...)longest2 = nubBy (flip isPrefixOf) . sortBy (flip compare)
Update: sortBy (flip compare) is dolio's suggested improvement to the original reverse . sort. The original version does read better I think, but flipping the comparison is a bit more efficient than having to reverse the list after sorting it. It also starts to produce values lazily before finishing the sort.
Update: Pumpkin noted that a trie would be more efficient than nub... D'oh! nub, while it looks elegant and compact above is actually losing the benefit of us having sorted the list in the first place (it's implemented using a set of chinese box filter functions - effectively the same as a linked list of comparisons. If we match the nub, this is very efficient — that's the outermost box already! However if we're not a prefix, then we'll uselessly check if we're a prefix of every other path, even though we couldn't possibly be.
What we really need is precisely the semantics of Unix's uniq, which expects a sorted list, and can therefore do a single pass, like our reduce version. Olathe pointed out a solution with map head . group which is almost what we want... except we need the -By version. Given a new function uniqBy, we just need to substitute it for nubBy and we're suddenly efficient again!
Update 2009-01-27: It occurred to me that instead of all the flipping, we could simplify uniqBy by making it take the last element instead of the head.uniqBy eq = map head . groupBy eq longest3 = uniqBy (flip isPrefixOf) . sortBy (flip compare)
Update: X pointed out that there's a bug: ["/a", "/aa", "/aaa"] will all get smushed together as they're substrings. Actually whether this is a bug to the original specification is debatable - Joel did mention "strings", however he did also mention that they are file paths, so let's try to make them do the right thing:longest4 = uniqBy' isPrefixOf . sort uniqBy' eq = map last . groupBy eq
In fact the basic core uniqBy' isPrefixOf . sort can remain exactly the same. But instead of a single string "/foo/bar/baz" we'll want to be passing it a list of strings like ["foo","bar","baz"]. As Haskell's list processing is completely generic, all the building blocks like isPrefixOf and sort will Just Work exactly as before! The only complication is splitting and joining the strings, something that's trivial in Perl. I wrote about splitting words in Haskell previously, but in the mean time Brent Yorgey has created Data.List.Split which should do exactly what we want. However I can't install it on my laptop's ubuntu packaged ghc 6.6.1, so for now I'll fall back to Text.Regex.splitRegex. Then, to join the path separator again we need to intersperse it, but that doesn't flatten the list, so we end up with something like ["foo", "/", "bar"]. Control.Monad's join function, generic as it is, actually does the right thing here. So we end up with:
longest5 :: [String] -> [String]
longest5 = map rejoin . uniqBy' isPrefixOf . sort . map split
where split = splitRegex $ mkRegex "/"
rejoin = join . intersperse "/"
We can test this against an input like [ "/a", "/aa", "/a/a" ] to check it's all
correct.
The same approach should work in Perl, except that splitting/joining will be slightly simpler, while the equivalent isPrefixOf logic is more complex, as it doesn't have the same generic list comparisons. I'll leave that as an exercise to the reader :-)
Talking of which, thanks to Will for the Python/Erlang versions in the comments, "Anon" for an Ocaml solution, and "Programmer" for a Perl challenge I'll come back to if time allows.
Update 2009-01-30: Will pointed out that my "improved" testcase isn't. Better would be ["/a", "/aa", "/b", "/b/b"] to check that /a and /aa are distinct, but /b is merged into /b/b.
He also sent an improved Erlang version, thanks!
Osfameron's blog on Haskell, Perl programming, stuff.
Programmer
January 27th, 2009 at 10:58 am
"Penfold suggested splitting on "/" into a multilevel hash,"
Ugh, one of the absolute worst things with perl: nonsensical handling of nested data structures. Every other modern language does this vastly better.
One of the many reasons Perl is obsolete.
Will Boyce
January 27th, 2009 at 2:44 pm
Python:
Erlang:
x
January 27th, 2009 at 7:57 pm
of course there is a bug here - run it on this testcase: ["/a", "/aa", "/aaa", "/aaaa"]
Anon
January 27th, 2009 at 11:01 pm
Ocaml:
Jedai
January 30th, 2009 at 7:47 pm
Note that System.FilePath has convenient functions to explode and recompose a path, so you could write :
More longest paths, and sick folds. - Just another lambdabananacamel,
January 31st, 2009 at 12:08 am
[...] paths, and sick folds. 31Jan2009 Filed under: Uncategorized Author: osfameron This week's simple longest path exercise seems to have had more mileage in it than I expected. Thanks to everyone's comments and [...]
G. Wade Johnson
February 1st, 2009 at 1:42 am
A simpler change to fix the "/a" and "/aa" problem is to modify the regular expression in the longest method to
That way it only matches a full directory name or the whole string.