The preparations for this year's Italian Perl workshop are hotting up, and it's looking like it might even top last year's event. I'll just focus on the international track (in English) here: we've already got some great guest speakers lined up:
  • Tim Bunce, the author of DBI, one of the finest database interfaces known to man. Tim is also a smashing bloke, and this is his return invite to the IPW, by popular demand!
  • Jonathan Worthington, one of the core devs for Perl 6 (Rakudo and Parrot VM).
  • Thomas Fuchs, Javascript guru - who says Perl conferences are just about Perl!
  • Amy Hoy/Fuchs, user interaction and product design expert.
And we have at least one other tempting (but not yet confirmed) morsel up our sleeves...

Dave Rolsky (autarch) on Moose...?

Moose is a "postmodern OO system" for Perl. If you love LISP, you might be interested to know it's based on CLOS. If you love Smalltalk, you might appreciate its use of Roles. In any case, Moose is full of awesomeness, and is currently making Perl OO look way more attractive and powerful than it used to...

Dave Rolsky (autarch) is one of the core Moose devs, as well as the author of DateTime and HTML::Mason among others. We'd really like him to talk at the IPW, but there's just the small matter of paying his airfare from the States... Luckily, he's willing to give a special paid day's training on Moose, probably on Wednesday 21st October, the day before the conference. If we can get enough participants, we'll be able to invite him for the workshop too...

All the details are in this announcement, but I'd just like to point out:

  • The ticket price (likely to be 100 Euros for the day) is ridiculously good value compared to the normal going rate - that's because it's a "grassroots" price, solely for the purpose of paying his way to talk at the Italian Perl Workshop (which is entirely free!)
  • Dave doesn't get to Europe much. In fact, up till now, never - so this is a very rare opportunity.
  • Pisa is really central - you can fly there from a massive number of European cities, often direct, and often with a choice of traditional or low-cost airline.
  • There will most likely be a whole track in English: as well as the guest speakers I mentioned above, there'll be others. For example, I'll probably be speaking... why don't you submit a talk too ;-)
If you're interested, please email info@perl.it.
The Perl web framework Catalyst has, and I'm fairly sure this is in common with many such frameworks in any programming language, a concept of a store of data, scoped to the current request, called the "stash". I've mentioned a few times that "I don't like it" and the reaction I get suggests that I don't understand it. So it's a good time for me to be ignorant in public so some clever people that do understand can tell me what I'm missing ;-)

Global status

I say that the stash is "request scoped". But really, for the lifetime of the web request, it's a big global pile of random data. Some things would naturally seem to fit there: big things to do with global status, like the logged-in user. So, for example, a base action might check for POSTed login details, and for session cookies, and set $c->stash->{user} to a MyApp::User object. Then, further down the line, another action might want to read user details, and can do so by retrieving it from the stash. This may seem very convenient. But of course, being in a hash, we don't get certain protections. For example, should later actions be able to read the user object? Should they be able to modify it? What about typing? What if someone sets {user} to a String, or an undefined value? It really seems like it might be better to model this kind of global status as an object, for example:
    use MooseX::Declare;

    class MyApp::Status {
        has 'user' => (
            isa       => Maybe[ MyApp::User ],
            predicate => 'is_logged_in',
            is        => 'rw',
            coerce    => 1,
            );

        coerce 'MyApp::User'
            => from 'Str'
            => via { return MyApp::User->from_name($_) };
        coerce 'MyApp::User'
            => from 'Int'
            => via { return MyApp::User->from_id($_) };

        ... # other global status things
    }
This way we get lots of modern/enlightened/whatever Perl OO niceties like coercion, and methods built for free, like is_logged_in. We could also provide a custom accessor that didn't allow the user to be updated if it's already been set, etc.

Passing information down the chain

I mentioned that a base action could set the logged in status near the beginning of the request. In fact, this is a common pattern in many web frameworks, with a whole pipeline of actions happening in order. For example, Catalyst has "chained actions", where a URL like
    /client1/documents/invoices/23
might build a chain like this:
  • login
  • admin login
  • client specific customizations
  • document template setup
  • invoice template setup
  • fetch invoice number 23
It's unlikely these days that you would write:
    method handle client1_documents_invoices ($invoice_number) {
        $self->login          or return;
        $self->check_is_admin or return;
        my %client_args = $self->do_client1_customizations;
        my %template_args = (
            $self->document_template_setup(%client_args),
            $self->invoice_template_setup (%client_args),
            $self->fetch_invoice( $invoice_number ),
            );
        $self->process_template( %template_args );
    }
That would be ridiculous, as you'd have hundreds of such methods... one of the major selling points of a framework is to allow you to combine these little pieces flexibly and elegantly. So you end up with the elements in the chain as separate actions... But how do we manage the flow of data between these items (the %template_args and %client_args in my contrived example above, for example) ? We use the stash of course! So each action can read the values it needs from the stash, and write things later actions will need back into it.
    method an_action_in_a_long_chain () {
        my $foo = $c->stash->{foo};
        my $bar = do_something_to( $foo );
        $c->stash->bar( $bar );
    }
Hurray! Except that you may have noticed that we've reinvented passing formal parameters and return values using global data. Welcome to 1959, enjoy your stay in COBOL! And of course, if we get parameters via the stash, we lose the benefits of typing. And we have to be really certain that the bit of global data we want has been set by the time our action gets called. And, because we have a global namespace, we have to hope that some other action in the meantime hasn't set the value to the wrong sort of data. It also leads to us thinking about our data as singletons. When we realise later that we need to call an action on two separate objects, what do we do? We only have one global slot for the parameter name, so we now have to set it twice (or localize it), and we have to squirrel away the first return value before it gets overwritten (yuck)! Spaghetti and action-at-a-distance may not be the intent of the stash, but they do feel like the logical progression of the concept to me.

Template values

Perhaps the least heinous usage of the stash is as the final store of data to be sent to the template. Each action in the chain will set some information:
    $c->stash = {
        # set by a navigation component
        breadcrumb => [ 'client1', 'documents', 'invoices' ],
        menu_items => [ 'save', 'edit', 'delete' ],

        # set by the login action
        username => 'osfameron',
        usertype => 'admin',

        # set by the invoice action
        invoice_data => $MyApp::Model::Invoice,

        ...
        };
Again though, what's to prevent one action from stomping over the other's data? Will the template die if it gets a hash for invoice_data instead of a MyApp::Model::Invoice object? What happens if some data it was expecting never got set? What's the alternative? I suspect that a "widget" approach might be saner, and I have to confess I still haven't looked at Reaction yet: perhaps that's what I'm looking for?

How do other frameworks deal with this?

I'm not, by the way, bashing Catalyst in particular - it's what I'm using now, but the frameworks I've used or written in the past have also had the features of pipelines of actions, and some kind of flattened global state (whether it was a stash, or just a hash that was passed down the pipeline). Given the problems I'm seeing, I can't believe that this is the final goal, or even the state of the art. So:
  • What am I missing?
  • How does your framework deal with these issues more elegantly? (For that matter, how does Catalyst?) I'm especially interested in learning about how strongly-typed FP stacks (HAppS?) confront this stuff.
  • Where next?

Is currying monadic?

7 May 2009 In: haskell

Here's a question that came up while I've been trying to implement currying in Perl: is Currying monadic? I've tried a couple of times, but not managed to explain what I mean very well on #haskell, so let's see if a longer writeup explains it better.

My simplistic understanding of monads is that they take various things that are nests of nested expressions, and allow you to reason about them, and given them a pretty syntax that makes it look like they are in fact a sequence of commands.

For example: (pseudocode)

    let a = 1
    let b = 2
    output a+b
Looks like a sequence of commands, but you couldn't just separate each line and string the commands together. Rather you have to consider it as a nest:
    (let a = 1
        (let b = 2
            (output a+b )))
And in many cases, we can go from a nested structure to a monad, for example, we could simplify these horribly nested ifs:
    (if file exists
        (if file is readable
            (if reading the file gave a string
                (if the string matches a username
                    (do some action with the username)))))
... into a nice Maybe monad:
    do check file exists
       check file is readable
       string <- read from file
       check string matches username
       do something with the username
Similarly, nested lists:
    (for each i in list 1
        (for each j in list 2
            (is i the same as j?
                (output i))))
can be abstracted as a List monad:
    do i <- list 1
       j <- list 2
       check i == j
       output i
OK, so I'm not talking about wrapping and unwrapping values, the monad laws, or typing in general. They are what gives this stuff its strong theoretical basis and makes it robust. But the simple-minded "Nest -> Sequence" idea works for me at least, and gives a good feel for where monads can come in useful.

Currying

So... when I was implementing currying, I noted that you could write a currying sub (pseudocode again):
    function add3 (a, b, c) {
        return a+b+c
    }
as something like this:
    function add3 (a) {
        return function (b) {
            return function (c) {
                return a+b+c 
            } 
        } 
    }
This is again a nested expression. So I wondered if you could again "flatten" it with a monadic do block:
    let add3 = do
        a <- get first parameter
        b <- get second parameter
        c <- get third parameter
        return a+b+c
OK, so I "know" that functions in Haskell (which uses currying for functions as a general rule) are the "Reader monad". But I don't understand it well enough to know if that means you can use Reader to implement the above...

(I don't understand Reader at all in fact. I must bang my head against it again, but I find it very confusing - how the monad is represented, what the functions are, and how they get magically applied.)

So, I did attempt to implement it using my Perl module Acme::Monads. This test shows that this sort of works:

    my $add = mdo (2) {
        mbind $x = Curry shift;
        mbind $y = Curry shift;
        return $x+$y;
        };

    say $add->(1)->(2); # 3, yay!
Note that the "return" isn't a monadic Unit but Perl's return, i.e. a plain value. That makes this example strictly speaking not monadic: what I don't know is whether that means the whole idea is fatally flawed, or whether (as I believe) I was just too dumb to fix the errors I got when I tried running with munit... I suspect that it should be possible, with the whole expression then being wrapped by some function (runCurry?) which extracts the final result out of the monad.

Did this explanation make any sense? Please let me know, and any comments on whether it's possible/sane to do this (writing currying as a monad) are appreciated!

Currying in Perl

4 May 2009 In: haskell, perl

"Currying" is a simple idea that is surprisingly powerful on the one hand, and which was surprisingly hard (at least for me) to get my head around - possibly because the concept didn't exist natively in the languages I learnt first.

When you declare a function in currying style, each argument is taken one at a time, returning a new, more specialised function each time, until the function is finally executed at the end.

Consider this function:

    sub add ($left, $right) {
        return $left + $right;
    }
(Note we're assuming that Perl has argument lists... which it can do with Devel::Declare, but we'll come to that in a bit). If this was currying style then we wouldn't call it with
    my $answer = add (1, 3);  # 4
but instead
    my $answer = add(1) #  a function with $left bound to 1
                  ->(3) #  now executed with $right bound to 3

Implementation

This is actually quite simple to implement in pure vanilla Perl.
    sub add {
        my $left = shift;
        return sub {
            my $right = shift;
            return $left + $right;
        };
    }
That isn't very pretty or convenient though... handily there are several modules to encapsulate this behaviour on CPAN. One of them is my Sub::Curried. The docs mention some of the other modules with similar functionality: mine, which uses the shiny goodness of Devel::Declare, has the advantage that you can declare a curried subroutine with a simple, perlish syntax: in fact you do it more or less exactly like the example I gave above. (The difference is that we can't override the sub keyword, so we create a new one, 'curry':
    curry add ($left, $right) {
        return $left + $right;
    }
Using D::D, the moment the Perl parser sees a symbol 'curry' being compiled, it hands control to our custom parser, which then injects code into the source while it's still being compiled. We can do some cunning stuff, including telling it "Hey, when you get to the end of the scope you're compiling, inject some more text!"

I used to keep hold of an array @filled of the partially applied arguments and then apply all in one go at the end. But it seems to be more elegant to actually transform into sometihng like the "vanilla" example I gave above. I say "something like" but it's actually little more complicated:

    sub add {
        return ROUTINE unless @_; # a reference to this subroutine
        check_args(2, @_);        # check we weren't called with >2 args
        my $f = sub {
            my $left = shift;
            return ROUTINE ...
            check_args...         # check we weren't called with >1 arg
            my $f = sub {
                my $right = shift;
                return $left + $right; # actually do the thing
                };
            $f = ...
            };
        # now call the subroutine for each 
        $f = $f->(shift) for @_;
        return $f;
    }
Yikes! The extra boilerplate is there to make sure that we get some niceties:
  • Die with informative error message if we're called with too many arguments.
  • Handle being called with multiple arguments: i.e. treat add(1,2) the same as add(1)->(2)
  • When called with zero arguments, the sub returned is logically (and cutely!) an alias. (I think the ROUTINE trick isn't needed, will probably disappear with a restructure)
  • Handle functions that return multiple values smoothly (not shown above)

Uses of currying

OK, so we've done a lot of furious work behind the scenes to make something look very simple while doing a whole lot of extra work. But why? When you program in Haskell, you'll find currying useful at every turn: it's a hunch that it'd be useful in Perl too. There are certainly some cute examples of it in common use:

Currying the invocant

Modules that do setup on a class would often use class methods: for example:
    package My::Class;
    use base 'Class::Accessor';
    __PACKAGE__->add_accessor('foo');
    __PACKAGE__->add_accessor('bar');
    __PACKAGE__->add_accessor('baz');
__PACKAGE__ refers to the current package, so this is the same as writing:
    My::Class->add_accessor('foo');
    ...
It's also utterly hideous. Moose on the other hand provides a syntax like this:
    package My::Class;
    use Moose;
    has 'foo' => ...
    has 'bar' => ...
    has 'baz' => ...
What's going on? Incredibly, 'has' is actually a class method just like 'add_accessor' was! But the leftmost argument (the invocant, usually referred to as $self for object methods, or $class for class methods) has been curried into it. This is because Perl's importing is dynamic and instead of just copying the method.
    *{CALLER::has} = \&has;
It can do somethingl like this:
    *{CALLER::has} = has($CALLER); # assuming a currying 'has'

Sections

In Haskell you can take references not just to functions, but to operators.
    add = (+)  -- alias
    add 1 2    -- result is 3
    (+) 1 2    -- also 3
You can also take 'sections' of these operators, by 'partially applying' either the left or the right hand side.
    add2       = (+ 2)
    halve      = (/ 2)
    reciprocal = (1.0 /) 
This isn't the same as currying, though you could implement sections with curried functions:
    curry divide ($left, $right) {
        $left / $right
    }

    my $reciprocal = divide(1);         # 1    / $ARG
    my $halve      = flip(divide)->(2); # $ARG / 2
That's rather ugly though, and remembering to flip the arguments is annoying. So, again with Devel::Declare I implemented Sub::Section (not yet on CPAN, the github repo is linked for now).
    my $add2         = op(+ 2);
    my $halve        = op(/ 2);
    my $contains_foo = op(=~ 'foo');
And of course:
    my $greet        = op("Hello " .);
    say $greet->('World');

Talk

I'll be talking about Functional Perl at the NorthWestEngland perlmongers tomorrow Tues 5th in Manchester, UK.
The nice chaps at NorthWestEngland PerlMongers have organized a technical meeting on 5th May, at the Manchester Digital Development Agency.

I'll be doing a 3rd version of my Functional Pe(a)rls talk, about Haskell-inspired craziness in Pure Perl. And Matt Trout and Ian Norton will be talking about OO Database design, and Maildir migration, so there's something for everyone.

Please come and say hello! Perl meetups are quite informal and friendly, and we'll be meeting up for drinks at the Salisbury Pub afterwards (Perl talk strictly optional. Come for the beer, or to tell us how much you prefer C#, or that you'd rather be at PHPNW just down the road... d'oh!)

(rough) Grids in Haskell

10 Mar 2009 In: haskell
(This isn't a full blog post, but a note of a few things about implementing game grids in Haskell).
  • A [[Cell]] structure seems to make sense for a lot of boards. In fact, even the problems I'm looking at might be approached simply indexing into row then col each time you want to access a list. However it feels inelegant, and some of the things I want to do (looking at lines of game pieces in a given direction N/S/E/W) would be more elegant if I can simply traverse the grid in an arbitrary direction.
  • morrow remembered there had been a discussion on haskell-cafe about grids (+ +) recently. It seems to be mainly about infinite grids, but has lots of interesting stuff to absorb.
  • paolino remembered that comonads were useful in grid representation. I don't understand what they are but google finds http://blog.sigfpe.com/2006/12/evaluating-cellular-automata-is.html which looks interesting.
  • I think I want a zipper, but didn't know how it would work on [[Cell]]. Saizan clarified: "if Zipper [] a is the zipper for [a], then the one for [[a]] is Zipper [] (Zipper [] a), and depending on which layer you use the next/previous operations you move in one dimension or the other". This isn't quite true though, as when you move up/down by rows, you have to somehow remember the columnwise index.
  • (Saizan referred to fmap'ing operations to keep the columns in sync, which is, I think, a more higher-order way of doing the same thing :-)
  • dolio posted an ADT representation of a grid. I don't understand that at all yet, but it is pretty... He also wrote a Traversable instance (this would appear to be a generic way to expose functions to move around in your arbitrary datastructure)
  • I decided to try to fold a [[Cell]] into a Cell { value=x, down=d, right=r }. This took me several hours last night, during which I cursed my feeble brain, and the fact that mutable references would have made the task trivial in other languages. (However, mutable references would make the grid useless for many useful things, like keeping intermediate copies of grid state). My no-doubt crappy code is on github (I'll come back to it in more detail later)
  • While I'm whining about haskell being "too hard" I should mention that I've got bogged down in representing grids (for games or spreadsheet) in Java, Perl, and Javascript too, so I think it may be more a case of my brain being feeble. That or I'm just overcomplicating things...
  • (As I remember it, the OO representation of grids is simpler to get started with but then I get bogged down with issues of multiple inheritance etc.)
Comments and suggestions welcome, I should write a full post on this soon.

More longest paths, and sick folds.

31 Jan 2009 In: haskell, perl
This week's simple longest path exercise seems to have had more mileage in it than I expected. Thanks to everyone's comments and suggestions, I've updated with a number of times with, among other things, an improved Haskell version that acts on path elements instead of just characters.

But I had intended to do a version of this in Perl. I mentioned last time that the typical approach in this language would be nested hashes... which elicited the comment:

Ugh, one of the absolute worst things with perl: nonsensical handling of nested data structures. Every other modern language does this vastly better.
Em's fightin' words! So let's try it in Perl and see how it looks...

First of all we need to create the multilevel hash. We want to end up with something like this:

        {
          '' => {
                  'a' => {
                           'a' => {}
                         },
                  'qux' => {
                             'wibble' => {}
                           },
                  'foo' => {
                             'bar' => {
                                        'baz' => {}
                                      }
                           },
                  'aa' => {}
                }
        };
The usual way to do this is to maintain an idea of the node that we're currently in, and follow it to the next item of the path, changing the $node variable to the next level down. So, if we had a path /foo/bar/baz, we'd start off with an empty hash {}, create the next node { foo=> {} } and start the process again with the new empty hash {} and the remaining path bar/baz.

Remind you of anything? I thought it looked a little like a fold with an accumulator variable: the $node is the accumulator, which is initialised to {} and which gets set to each corresponding node in turn. The path is the list that we fold over. And it tickled me that we can write:

use List::Util 'reduce';
sub mk_hash {
    my %hash;
    reduce { $a->{$b} ||= {} }
           \%hash,
           split '/'
               for @_;
    return \%hash;
}
In polite company (well, in Perlish polite company at last), calling map in a void context is frowned upon, as it's abusing a functional operator for its side effects. Here we are abusing reduce, making it destructive on a tree, and discarding its (useless) return value (the final leaf $node element). But it does work exactly as required: (the for @_ means that we run this sick fold on each path in turn) and returns a multilevel hash.

I'm aware that the example above might confirm the original commenter's dislike of Perl hackishness. Personally I think it's quite cute, mixing a beautiful functional idiom with pragmatic mutation. This may be brought on by my dabbling with Perl and FP programming languages, and is probably incurable. If you are either a) disgusted or b) confused by the Perl code above, you might like to try implementing the traditional algorithm (please let me know!) We'll have a look at the equivalent in Haskell shortly.

There isn't a leaves function built in for multilevel hashes (we do find some in the Tree:: and Graph:: namespaces, but let's stick with the core datastructures for now), so let's build one. This version is actually very similar to an FP language definition I think.

sub leaves {
    my ($node, $path) = @_;
    if (keys %$node) {
        # We still have to descend into all the leaves
        map {
              my ($k,$v)=@$_;
              leaves($v, [@$path, $k] );
            }
        kv_list $node
    } else {
        # the base case - we are at a leaf, so return path!
        join '/', @$path;
    }
}
Sadly, though there's an inbuilt iterator each, there isn't a flat list of keys/values, so we'll have to define the function kv_list used above. The easiest way would be:
sub kv_list {
    my $hashref = shift;
    map { [$_ => $hashref->{$_}] } keys %$hashref;
}
Though we could also create an "iterator_to_list" function that acts on each.

And now, the Haskell version: let's start with creating the tree. Instead of a hash, we'll use Data.Map which is similar, but has an implementation better suited to purely functional usage. Of course, we can't use the same technique as the Perl version: we don't have destructive mutation, and if we did a fold, it would merely return the final leaf node of each path. So we start with a simpler recursive definition. I say "simpler", but I needed a lot of help with this. Luckily #haskell came to the rescue: rwbarton pointed out that I'd need a newtype to do a recursive map, Heffalump improved it with unNode, sjanssen and quicksilver helped with a missing node constructors etc. So eventually my naive version of the code looked like:

import qualified Data.Map as M
import Text.Regex

newtype Node = Node { unNode :: M.Map String Node }
    deriving Show

dive :: Node -> [String] -> Node
dive n     [] = n
dive n (s:ss) = let v  = M.lookup s (unNode n) :: Maybe Node
                    n' = case v of
                            Nothing -> Node M.empty
                            Just v' -> v'
                            :: Node
                in Node $ M.insert s (dive n' ss) (unNode n)

splitpath = splitRegex $ mkRegex "/"

make_tree ps = foldl dive (Node M.empty) $ map splitpath ps
This is quite clumsy, and could be improved by replacing the lookup/insertion with a single insertWith. The Data.Map API is quite large and you need to play with it a bit to get your head around it! And of course I got a better version, from sjanssen:
import qualified Data.Map as M
import Text.Regex
import Data.Monoid

newtype Node = Node { unNode :: M.Map String Node }

instance Monoid Node where
    mempty = Node M.empty
    mappend (Node x) (Node y) = Node $ M.unionWith mappend x y

splitPath = splitRegex $ mkRegex "/"

nodeFromPath = foldr (\x n -> Node $ M.singleton x n) mempty . splitPath

nodeFromPaths :: [FilePath] -> Node
nodeFromPaths = mconcat  . map nodeFromPath
There are a number of clever things here! First of all, he returns the root node using a fold, which I was dubious about being able to do. That's because instead of diving from the root, he starts from the right (foldr) and constructs the leaf node, then inserts that as the value of the node above, all the way up to the top. Very cute. Then note that he's using the Data.Map API elegantly: notice how he uses M.singleton where I would have naively done a fromList of a single element, for example. Also, instead of having to descend each tree, updating, he simply creates a set of single path trees, then merges them together at the end with M.unionWith. Finally, it's using Monoids, which (as well as being a classic Doctor Who monster) is a fancy way of saying "they behave like appendable things" (more or less). In fact you could write the snippet above without, but it does give us the convenient mempty and mconcat functions.

So, which version do you prefer? I mentioned to sjanssen that I thought the Haskell one had more "concepts", but actually both versions use folds and some sort of mapping data structure. One is destructive, the other pure. Of course, as he pointed out, not knowing Perl, my version looked incomprehensible. But if you didn't know how to write the sick fold above in Perl, you could have easily written a simple recursive version. Whereas the Haskell version does require some knowledge, like the use of recursive newtypes (which I found very confusing — and hard to compile/debug) and as I've mentioned, the Data.Map API is large, as befits its power.

I'm too tired to attempt the leaves function in Haskell — please do comment if you give it a go!

There’s the nub (snippet in Perl and Haskell)

27 Jan 2009 In: haskell, perl
Here's a simple problem, with solutions in Perl and Haskell.
@joel: suppose I have a list of strings (they happen to be paths) -
       how might I find only the longest instance of each path?
@joel: that is give /foo /foo/bar /foo/bar/baz /qux I only want back
       /foo/bar/baz and /qux
@joel: do I need to build a trie myself? my cpanfu is failing me
As this was a Perl channel, Penfold suggested splitting on "/" into a multilevel hash, then outputting the leaf nodes. Then pjcj suggested, performance permitting, to go through the list deleting each entry that is a substring of the previous entry: that is, treating the strings simply as strings of characters.

This makes it sound like you have to compare each string with the strings shorter than it, but it turns out that you can get exactly the right behaviour on an asciibetically sorted list. For example, a set of paths similar to the ones Joel quoted might get sorted as:

  • /foo
  • /foo/bar
  • /foo/bar/baz
  • /qux
  • /qux/wibble
You can see that this should reduce to
  • /foo/bar/baz
  • /qux/wibble
as the list collapses the previous element every time we realise that it is a substring prefix of the following element.

We can do this in a single pass using a fold, which is a pattern that's less common in Perl than other idioms from functional programming like map and grep, but which is implemented in List::Util's reduce. A typical example of reduce might be this:

  sub sum {
    reduce { $a+$b } @_;
  }
  print sum (1,2,3); # = 6
Of course there the two elements $a and $b are the same kind of thing (a number) and the reducing function ($a+$b) returns a number in turn. But for the longest paths problem, we need something else: a list of results. This is a common idiom in Haskell: using an "accumulator" as the initial value. However Perl's reduce uses the first element of the list as the initial value. That's often a very good strategy often known as foldl1, but for a long time I thought that the Perl implementation was weak as it didn't allow a more flexible arbitrary value. I was wrong. LeoNerd pointed out in #moose that you can just pass the init value as the first element of the list! (This works in Perl of course, as lists are untyped). For example, this subroutine would be a (rather silly) way of turning a list into a list reference:
  sub as_list {
    reduce { [@$a,$b] } [], @_
  }
So, we're going to call our function like this:
  my $longest_list = longest qw( /foo /qux/wibble /foo/bar/baz /qux /foo/bar );
And we can implement it like this:
  sub longest {
    reduce {
        my ($acc, $val) = ($a, $b); # 'accumulator' and 'value'
        my $last = pop @$acc || ''; # The "last" value is an empty string
                                    # the first time around

        # We return a list reference, which will be the accumulator 
        # the next time around
        [ @$acc,
          $val =~ /^\Q$last\E/ ? () : ($last), # collapsed, if substring
          $val ]
      }
      [],      # initial acumulator (empty list)
      sort @_; # sorted input list
  }
OK, so I promised myself I'd sketch this in Haskell too. It should be slightly simpler as the hackish popping off the end of the list is replaced with a nice pattern match.
  longest :: [String] -> [String]
  longest = foldl aux [] . sort
    where aux []         v = [v]
          aux xss@(x:xs) v = v :
                            (if x `isPrefixOf` v
                                then xs
                                else x:xs)
As often happens when converting an algorithm to Haskell, the result list is backwards, as we're consing the new result rather than appending to it. But in fact, it's not especially elegant: the conditional concatenation is a bit clumsy for example. I asked for feedback on #haskell. And, as so often happens on that channel, I got an altogether more compact version: augustss suggested something like:
  longest2 = nubBy (flip isPrefixOf) . sortBy (flip compare)
nubBy is precisely the pattern that reduces a list based on whether adjacent elements are to be collapsed or not. (Apparently you may need to reverse the sense of nubBy in GHC 6.10 — I'm using 6.6.1 — which simplifies to nubBy isPrefixOf. Nothing like backwards compatibility...)

Update: sortBy (flip compare) is dolio's suggested improvement to the original reverse . sort. The original version does read better I think, but flipping the comparison is a bit more efficient than having to reverse the list after sorting it. It also starts to produce values lazily before finishing the sort.

Update: Pumpkin noted that a trie would be more efficient than nub... D'oh! nub, while it looks elegant and compact above is actually losing the benefit of us having sorted the list in the first place (it's implemented using a set of chinese box filter functions - effectively the same as a linked list of comparisons. If we match the nub, this is very efficient — that's the outermost box already! However if we're not a prefix, then we'll uselessly check if we're a prefix of every other path, even though we couldn't possibly be.

What we really need is precisely the semantics of Unix's uniq, which expects a sorted list, and can therefore do a single pass, like our reduce version. Olathe pointed out a solution with map head . group which is almost what we want... except we need the -By version. Given a new function uniqBy, we just need to substitute it for nubBy and we're suddenly efficient again!

  uniqBy eq = map head . groupBy eq
  longest3 = uniqBy (flip isPrefixOf) . sortBy (flip compare)
Update 2009-01-27: It occurred to me that instead of all the flipping, we could simplify uniqBy by making it take the last element instead of the head.
  longest4 = uniqBy' isPrefixOf . sort
  uniqBy' eq = map last . groupBy eq
Update: X pointed out that there's a bug: ["/a", "/aa", "/aaa"] will all get smushed together as they're substrings. Actually whether this is a bug to the original specification is debatable - Joel did mention "strings", however he did also mention that they are file paths, so let's try to make them do the right thing:

In fact the basic core uniqBy' isPrefixOf . sort can remain exactly the same. But instead of a single string "/foo/bar/baz" we'll want to be passing it a list of strings like ["foo","bar","baz"]. As Haskell's list processing is completely generic, all the building blocks like isPrefixOf and sort will Just Work exactly as before! The only complication is splitting and joining the strings, something that's trivial in Perl. I wrote about splitting words in Haskell previously, but in the mean time Brent Yorgey has created Data.List.Split which should do exactly what we want. However I can't install it on my laptop's ubuntu packaged ghc 6.6.1, so for now I'll fall back to Text.Regex.splitRegex. Then, to join the path separator again we need to intersperse it, but that doesn't flatten the list, so we end up with something like ["foo", "/", "bar"]. Control.Monad's join function, generic as it is, actually does the right thing here. So we end up with:

  longest5 :: [String] -> [String]
  longest5 = map rejoin . uniqBy' isPrefixOf . sort . map split
      where split   = splitRegex $ mkRegex "/"
            rejoin  = join . intersperse "/"
We can test this against an input like [ "/a", "/aa", "/a/a" ] to check it's all correct.

The same approach should work in Perl, except that splitting/joining will be slightly simpler, while the equivalent isPrefixOf logic is more complex, as it doesn't have the same generic list comparisons. I'll leave that as an exercise to the reader :-)

Talking of which, thanks to Will for the Python/Erlang versions in the comments, "Anon" for an Ocaml solution, and "Programmer" for a Perl challenge I'll come back to if time allows.

Update 2009-01-30: Will pointed out that my "improved" testcase isn't. Better would be ["/a", "/aa", "/b", "/b/b"] to check that /a and /aa are distinct, but /b is merged into /b/b.

He also sent an improved Erlang version, thanks!

I have now been without my Nokia N95 8Gb for over two weeks and am increasingly irritable about Vodafone's rather shoddy customer service.

One assistant checked my phone in for repair, but even though I'd indicated possible water damage didn't preventatively arrange the insurance cover. The assistant the weekend later said this was a "training issue", oops, and my phone would be delivered on the 23rd of December.

Handily that was when I was travelling back from my work Christmas party. The next day would be fine, but apparently the booking system can't actually handle choosing the correct date for a delivery. But that's ok! All we had to do was game the system: pencil in a delivery for the 23rd, and get me to call back to get it booked to the 24th (the phone delivery elves work on Christmas Eve after all). Worryingly, my several phone calls were met with confusion as to why I was calling now and not the 24th (er, because that's the day after my delivery is booked), the 23rd (er, because the phone might already be out for delivery), and finally the 22nd; also with denial of there being any note left by the shop on my records to warn them that a change of date would be required. But finally, I was told to not worry: the phone would go out for delivery on the 23rd, but the company would make 3 attempts to deliver, so I'd get it on the 24th anyway. Phew.

Vodasanta did not come down my chimney on the 24th.

I went into the shop after Christmas, and was told that the delivery company had been waiting for me to call them to arrange delivery on the 24th, why didn't I call? (Apparently there were no notes on my records about the calls I did make. Naughty elves.)

So, after going into town to visit the Liverpool vodafone store 3 times, making several calls and waiting in for a whole day for a failed delivery (8:30-5:30 is apparently a decent delivery "slot"), the best they can do is arrange to deliver the handset to me. Again. The next available date is the 30th, when I'm at my parents: and they can't do better than that same time range, so if my family have other plans for the day (hardly unreasonable) then I will have to put off the delivery way into the new year.

Of course they couldn't make a reasonable offer at retaining customer satisfaction by, say, offering me an equivalent handset. This is because the insurance people already have the N95 8Gb in stock (how nice for them). I pushed this point and said I would let this lie and be happy if they could substitute me for an E71 right now in store (it was my original second choice of handset) but apparently that's "not allowed" as it's a PDA style phone. It's very important, apparently, for them only to offer a "standard" service: messing me about for 2 weeks comes as standard; actually delivering the phone that I've paid an insurance premium for does not.

The one gesture that Vodafone have made is to refund the £25 excess on the insurance claim. That's appreciated of course, but I had to ask for it after expressing my clear frustration both in the shop and over the phone to the insurance team.

Ho ho ho.

Edit (20 Jan 2009) I got a reply from a James from Vodafone on 31st December suggesting I should get in touch and he would try to resolve. Good news right? Sadly, all I got from the online form submission (including a reference I'd assumed would dispatch the query to him) was a confused reply from a generic customer service representative. I emailed James direct and heard nothing - I'm currently assuming he was an impostor pretending to be from Vodafone for comic effect.

I've since made an official written complaint to Vodafone head office, the insurance service, and the Liverpool store. I've had once reply so far from the insurance, which is fairly generic, though it does mention "On checking our records we can see that you did not make contact until 27 December 2008" - apparently it is far more likely that I lied about calling 3 times than that their record keeping is at fault.

Edit (26 Jan 2009) (Been a bit busy with long weekend away to write this up till now.) Got an email from Paul at Vodafone's customer service on the 22nd. Apparently James is not an impostor, but merely failed to get my messages because a) the code I typed in my response didn't work (apparently I shouldn't have put it in parentheses, which seems a bit silly) and b) didn't receive the email I sent (to the address he left on my comment form). Apparently the refusal to substitute the N95 8Gb with another model is because it "was and still is an expensive handset", which I do agree makes sense for their logistics, but which clearly doesn't for me as a customer.

Paul has made a credit for 2 months' free line rental, which is at least a reasonable gesture: thanks.

Crossword puzzles in Haskell

18 Dec 2008 In: haskell

Every year or so I come back to the problem of writing a crossword puzzle compiler/player. I think Javascript would be the most promising for a web-based player, though I've given it a go in Java and Perl too. Modeling the problem is interesting in an Object Oriented language - I would find myself getting bogged down with "Lines" and the similarities between rows (Across) and columns (Down). I have a suspicion that OO Roles might be a more expressive way to model this. Anyway, given that I've not been writing much about Haskell 1, this is a good time to redress the balance.

In the OO implementations, Cells would refer to the "Light" (group of adjacent cells running Across/Down) that they're in. And the Light would of course refer to the cells... this idea filled me with terror in Haskell, as it involves "Tying the Knot", which seems terribly clever and confusing. As it happens, you can often get away with having mutual references in Functional Programming, as they just spontaneously turn out not to be necessary. So far, this seems to be the case, though I think that if I take it to the point of navigating the grid from a UI, I may need an explicit structure to manage this, like a zipper.

This is a "literate Haskell" post. You should be able to save it as "crossword.lhs" and run it with runghc or ghci (calling the main function to test it). We start off with some imports: the List and Char modules (which I've naughtily not specified) and the handy join function to flatten a list.


> import Data.List
> import Data.Char
> import Control.Monad (join)

They say that much of the work in Haskell is defining the types. Direction is an Enum for the direction of a Light. Cell is either a Block (a black square) or a Cell (either blank, or already filled in). I guess I could model Block | Blank | Cell but don't yet see an advantage to that.

> data Direction = Across | Down
>     deriving (Show, Eq, Ord)
> 
> data Cell = Cell  (Maybe Char) Coord
>           | Block              Coord
>     deriving (Show, Eq)
>
> type Coord  = (Int, Int)
>
> coord (Cell _ c) = c
> coord (Block  c) = c

Directions can be sorted (which may come in useful for showing Across clues before Down ones), and this can be done automatically by deriving Eq and Ord. But how would we sort a Cell? We'll do it by the (x,y) coordinates, which means I can't use automatic derivation. Perhaps I could swap the order of arguments and still use that, but for now I defined a custom compare.

> instance Ord Cell
>     where compare l r = compare (coord l) (coord r)

Lights (distinct from Clues, which may be spread over 1 or more Lights) are a list of cells in a given direction. (It would be nice to specify that the cells really are contiguous, not sure if this is something that fundeps would be useful for?)

> data Light     = Light [Cell] Direction
>     deriving (Show, Eq, Ord)
>
> -- we'll sometimes want to know the first cell in a Light:
> headC (Light (c:_) _) = c

Of course Lights are numbered... but with the algorithm that I'm using, we don't know the number (like 5 Across) at the time we create it. I created a new type, LightN, but perhaps I should have modeled with a Maybe Int instead.

> data LightN    = LightN Int Light
>     deriving Show

Now we need some test data. I started with a grid from the wonderful Araucaria, Cryptic Crossword No. 24298.

> grid = textToCells [ 
>     "TRIPOD#        ",
>     "# # # # # # # #",
>     "PARSIFAL#RESCUE",
>     "# # # # # # # #",
>     "###            ",
>     "# # # # # ### #",
>     "ANNA###        ",
>     "# # # # # # # #",
>     "        ###PARE",
>     "# ### # #U# # #",
>     "         S  ###",
>     "# # # # #E# # #",
>     "      #INFERNAL",
>     "# # # # #U# # #",
>     "FRAGMENT#L     "
>     ]

We're going to want to output the grid, lights, and other objects, so let's define some functions to do that.

> showCell (Block _)         = '#'
> showCell (Cell (Just c) _) = c
> showCell _                 = ' '

> showLightN :: LightN -> String
> showLightN (LightN n l@(Light cs d)) =
>        show n ++ " " 
>     ++ show d 
>  -- ++ " " ++ show (coord $ headC l) 
>     ++ ": "
>     ++ show (map showCell cs)
>     ++ " (" ++ (show $ length cs) ++ ")"
>     

Similarly, we want to parse the list of strings above into a list of crossword cells. textToCells threads the row and column number with every character in the grid by zipping the list with the infinite list [1..], which I think is quite cute, though there are no doubt more elegant versions (list comprehensions?)

> charToCell :: Char -> Coord -> Cell
> charToCell '#' = Block
> charToCell ' ' = Cell Nothing
> charToCell  c
>     | isAlpha c = Cell (Just c)
>     | otherwise = error $ "Invalid character '" ++ [c] ++ "'"
> 
> textToCells :: [[Char]] -> [[Cell]]
> textToCells                     = zipWith  makeRow       [1..] 
>     where makeRow  row          = zipWith (makeCell row) [1..] 
>           makeCell row col char = charToCell char (row,col)

But working out what cells are is the easy part! We now want to know which cells form a light — i.e. groups of more than 1 non-block cell in either direction Across/Down. To get data for both directions, it's easiest to run in two passes, one in the normal direction, the other transposed. (I did consider trying to do both at the same time, but it hurt my brain: a one pass solution involving magical fumplicative arroids or somesuch is left as an exercise to the very clever reader).

> lights dir grid = concatMap 
>                     (flip lightsInLine dir) 
>                     $ rot dir grid
>     where rot Across = id
>           rot Down   = transpose
>
> lightsInLine :: [Cell] -> Direction -> [Light]
> lightsInLine cells dir = 
>     let l  = filter isMultiCell 
>                $ groupBy areCells cells
>     in  map (\c -> Light c dir) l
>
> areCells x y = isCell x && isCell y 
> isCell (Cell _ _) = True
> isCell  _         = False
> 
> isMultiCell (x:y:_) | areCells x y = True
> isMultiCell _ = False

So... where have we got with all of this modeling? Well, we can now find all the Across and Down lights. But then we'll want to number them. To do that, we'd have to sort them (by the coordinate). Across and Down lights can have the same number (like 5 Across and 5 Down in our example grid) so we want to group by lights that have the same head cell. Then we can thread the light number again, using the zipWith ... [1..] trick:

> allLights = join $ zipWith (map . LightN) [1..] gs
>         where gs = groupBy eqHead ls
>               ls = sort $ (lights Across grid)
>                        ++ (lights Down   grid)
>               eqHead l r = (headC l) == (headC r)

And finally, we can see the result of all the hard work, with a list of all the lights, and their current (partial) solutions:

> main = mapM_ putStrLn $ map showLightN allLights

Obviously this is only a start on the problem. For modeling, we now need a concept of a Clue (Clue String [Light]) and a solution - should the solution belong to the clue? or to the [Light]s that it's made up of. How do we link the answer grid (where the lights contain the correct characters) with the play grid, which contains the current letters that the player believes to be right? And how do we update the cells, lights, and grid while playing (or creating) a crossword?

Suggestions on these questions, and improvements or advice on the current code are greatly appreciated!


  1. I had a complaint about this from a Planet Haskell reader: and though the FAQ does suggest that it's ok, or even encouraged to write about other things, perhaps I should also write a little about Haskell... ;-)

    back

About this blog

Osfameron's blog on Haskell, Perl programming, stuff.


Categories