More longest paths, and sick folds.

This week’s simple longest path exercise seems to have had more mileage in it than I expected. Thanks to everyone’s comments and suggestions, I’ve updated with a number of times with, among other things, an improved Haskell version that acts on path elements instead of just characters.

But I had intended to do a version of this in Perl. I mentioned last time that the typical approach in this language would be nested hashes… which elicited the comment:

Ugh, one of the absolute worst things with perl: nonsensical handling of nested data structures. Every other modern language does this vastly better.

Em’s fightin’ words! So let’s try it in Perl and see how it looks…

First of all we need to create the multilevel hash. We want to end up with something like this:

        {
          '' => {
                  'a' => {
                           'a' => {}
                         },
                  'qux' => {
                             'wibble' => {}
                           },
                  'foo' => {
                             'bar' => {
                                        'baz' => {}
                                      }
                           },
                  'aa' => {}
                }
        };

The usual way to do this is to maintain an idea of the node that we’re currently in, and follow it to the next item of the path, changing the $node variable to the next level down. So, if we had a path /foo/bar/baz, we’d start off with an empty hash {}, create the next node { foo=> {} } and start the process again with the new empty hash {} and the remaining path bar/baz.

Remind you of anything? I thought it looked a little like a fold with an accumulator variable: the $node is the accumulator, which is initialised to {} and which gets set to each corresponding node in turn. The path is the list that we fold over. And it tickled me that we can write:

use List::Util 'reduce';
sub mk_hash {
    my %hash;
    reduce { $a->{$b} ||= {} }
           \%hash,
           split '/'
               for @_;
    return \%hash;
}

In polite company (well, in Perlish polite company at last), calling map in a void context is frowned upon, as it’s abusing a functional operator for its side effects. Here we are abusing reduce, making it destructive on a tree, and discarding its (useless) return value (the final leaf $node element). But it does work exactly as required: (the for @_ means that we run this sick fold on each path in turn) and returns a multilevel hash.

I’m aware that the example above might confirm the original commenter’s dislike of Perl hackishness. Personally I think it’s quite cute, mixing a beautiful functional idiom with pragmatic mutation. This may be brought on by my dabbling with Perl and FP programming languages, and is probably incurable. If you are either a) disgusted or b) confused by the Perl code above, you might like to try implementing the traditional algorithm (please let me know!) We’ll have a look at the equivalent in Haskell shortly.

There isn’t a leaves function built in for multilevel hashes (we do find some in the Tree:: and Graph:: namespaces, but let’s stick with the core datastructures for now), so let’s build one. This version is actually very similar to an FP language definition I think.

sub leaves {
    my ($node, $path) = @_;
    if (keys %$node) {
        # We still have to descend into all the leaves
        map {
              my ($k,$v)=@$_;
              leaves($v, [@$path, $k] );
            }
        kv_list $node
    } else {
        # the base case - we are at a leaf, so return path!
        join '/', @$path;
    }
}

Sadly, though there’s an inbuilt iterator each, there isn’t a flat list of keys/values, so we’ll have to define the function kv_list used above. The easiest way would be:

sub kv_list {
    my $hashref = shift;
    map { [$_ => $hashref->{$_}] } keys %$hashref;
}

Though we could also create an “iterator_to_list” function that acts on each.

And now, the Haskell version: let’s start with creating the tree. Instead of a hash, we’ll use Data.Map which is similar, but has an implementation better suited to purely functional usage. Of course, we can’t use the same technique as the Perl version: we don’t have destructive mutation, and if we did a fold, it would merely return the final leaf node of each path. So we start with a simpler recursive definition. I say “simpler”, but I needed a lot of help with this. Luckily #haskell came to the rescue: rwbarton pointed out that I’d need a newtype to do a recursive map, Heffalump improved it with unNode, sjanssen and quicksilver helped with a missing node constructors etc. So eventually my naive version of the code looked like:

import qualified Data.Map as M
import Text.Regex

newtype Node = Node { unNode :: M.Map String Node }
    deriving Show

dive :: Node -> [String] -> Node
dive n     [] = n
dive n (s:ss) = let v  = M.lookup s (unNode n) :: Maybe Node
                    n' = case v of
                            Nothing -> Node M.empty
                            Just v' -> v'
                            :: Node
                in Node $ M.insert s (dive n' ss) (unNode n)

splitpath = splitRegex $ mkRegex "/"

make_tree ps = foldl dive (Node M.empty) $ map splitpath ps

This is quite clumsy, and could be improved by replacing the lookup/insertion with a single insertWith. The Data.Map API is quite large and you need to play with it a bit to get your head around it! And of course I got a better version, from sjanssen:

import qualified Data.Map as M
import Text.Regex
import Data.Monoid

newtype Node = Node { unNode :: M.Map String Node }

instance Monoid Node where
    mempty = Node M.empty
    mappend (Node x) (Node y) = Node $ M.unionWith mappend x y

splitPath = splitRegex $ mkRegex "/"

nodeFromPath = foldr (\x n -> Node $ M.singleton x n) mempty . splitPath

nodeFromPaths :: [FilePath] -> Node
nodeFromPaths = mconcat  . map nodeFromPath

There are a number of clever things here! First of all, he returns the root node using a fold, which I was dubious about being able to do. That’s because instead of diving from the root, he starts from the right (foldr) and constructs the leaf node, then inserts that as the value of the node above, all the way up to the top. Very cute. Then note that he’s using the Data.Map API elegantly: notice how he uses M.singleton where I would have naively done a fromList of a single element, for example. Also, instead of having to descend each tree, updating, he simply creates a set of single path trees, then merges them together at the end with M.unionWith. Finally, it’s using Monoids, which (as well as being a classic Doctor Who monster) is a fancy way of saying “they behave like appendable things” (more or less). In fact you could write the snippet above without, but it does give us the convenient mempty and mconcat functions.

So, which version do you prefer? I mentioned to sjanssen that I thought the Haskell one had more “concepts”, but actually both versions use folds and some sort of mapping data structure. One is destructive, the other pure.
Of course, as he pointed out, not knowing Perl, my version looked incomprehensible.
But if you didn’t know how to write the sick fold above in Perl, you could have easily written a simple recursive version. Whereas the Haskell version does require some knowledge, like the use of recursive newtypes (which I found very confusing — and hard to compile/debug) and as I’ve mentioned, the Data.Map API is large, as befits its power.

I’m too tired to attempt the leaves function in Haskell — please do comment if you give it a go!

Comments

  1. Senny says:

    I believe “hackish” is usually considered a complement in reference to Perl.

  2. Jedai says:

    A “leaves” function in Haskell :

    leaves :: Node -> [FilePath]
    leaves = map joinPath . go
    where
    go (Node m)
    | M.null m = [[]]
    | otherwise = concatMap (\(k,v) -> map (k :) $ go v) $ M.assocs m

    I use joinPath from System.FilePath, it correctly join directories (the inverse is splitDirectories rather than splitPath which keep too much informations).

  3. Programmer says:

    Who’s argument were you trying to prove? :) I programmed Perl almost exclusively for 5 years so I know it quite well, and the fold above is not a new concept for me. But it neither addresses nor debunks my comment. Everyone knows nested data structures are hard in Perl, just google for “Perl nested data structures” and see how many tutorials/tools/posts you find trying to make it easier. A wise man once said (of user interfaces) “if a feature needs to be documented it’s mis-implemented”. In other words, requirement of documentation can be a kind of code smell. The more a feature has to be explained the more likely it is a bad feature.

    One of the biggest problems with Perl is still having pointers (and this makes working with nested data structures more painful). Every modern language uses references [1]. Sadly this *still* isn’t going to be fixed in Perl 6.

    [1] The easiest way to understand what I mean here is to look at C++. Look at how pointers in C++ work and look at how C++ references work. Which one does Perl’s “references” most resemble(hint: which one has to be refed/dereffed)? In fact Perl does have what every other modern language calls references: the things that get implicitly passed to functions, map and so on (.e.g. $_[0]).

  4. osfameron says:

    @Programmer: yeah to be honest my main aim was to tackle the problem in two languages and see how they compared – at that point you hadn’t qualified the “vastly better” comment, and this was just an exploratory exercise. You’re quite right though that Perl’s references are hard to /learn/, I’ve spent a good bit of time mentoringand training on this, and it is a wart in the language. But the rules are quite consistent and once you have mastered them, it is as /capable/ as any other language. (And I think it’s clear from my post that I found haskell for one to have its own complexities/learning curve).

    I thought Perl6 would radically simplify this issue though, that was certainly one of the early goals as I understood it.

  5. Programmer says:

    These days there just isn’t a reason to use a language with such a large wart. Pretty much every other modern language gets this right.

    I think Perl 6 does make it easier, but they still have pointers (still confusingly called “references”).