I've been prototyping ways for customers to import data into our system. While some kind of JSON/XML structured data would be easiest for us, our customers like spreadsheets. That's cool of course, and it's easy to mock up imports like this.

Product Price Category
Brie £ 2.00 Dairy
Chablis £ 5.00 Wine

Of course not all data fits nicely with this kind of very flat structure. For example multiple values. We could do something like

Product Price Tags
Brie £ 2.00 Dairy,Cheese,Food
Chablis £ 5.00 Wine,Alcohol,Drink

but it might be nicer for the customer to organize like this instead:

Product Price Tags
Brie £ 2.00 Dairy
    Cheese
    Food
Chablis £ 5.00 Wine
    Alcohol
    Drink

The problem is that we can't really parse the table row by row, simply. We have to remember where the definition started, and finalise when the item has been completely defined (which will happen when the next item starts). I started prototyping this in Perl, and ended up with some typical imperative code like this:

    my (@products, $product);
    for my $r ( @rows ) {
        my $row = get_row_data($r);
        if ($row->has_product) {
            push @products, $product if $product;
            $product = Product->new( 
                product => $row->product, 
                price   => $row->price 
            );
        } else {
            die "No product" unless $product;
        }
        $product->add_tag($row->tag);
    }
    push @products, $product;

This is quite yucky. See how I'm repeating the push @products line: first time when seeing a new product, and second when we've exited the loop. Then we have the multiple assignments to $product and so on.

All of which seems to work OK, but I really dislike this logic, and am beginning to think it's a perfect example of an imperative antipattern. So, let's see if we can do it more elegantly. Thinking about how I'd do it functionally, in Haskell, I'd use groupBy on each row, connecting the rows up until the next row has a non-blank product.

So let's try that in Perl! We'd want something like:

    my @items = groupBy sub { 
            my (undef, $next_row) = @_;
            ! $next_row->has_product
        },
        map get_row_data->($_),
            @rows;

    my @products = map {
        my $row_1 = $_->[0];
        Product->new(
            product => $row_1->product,
            price   => $row_1->price,
            tags    => [ map $_->tag, @$_ ],
        );
        } 
        @items;

That has less repetition, and less accidental complexity. First we group the lines. Then we turn the groups of related rows into new objects. Job done.

The groupBy function is interesting: notice how the callback to it takes 2 arguments ($this_item, $next_item), though as we don't care about the current row, only the next one, we're just doing (undef, $next_row).

groupBy is also problematic though, in that it doesn't exist in any of the normal places I'd have expected (List::Util, List::MoreUtils etc.). So let's create it. The ideal way might be to translate Haskell's definition from the Prelude, but given that Perlish lists don't have lazy semantics, and we then have to implement span etc., let's just write a noddy imperative version for now:

    sub groupBy {
        my ($fn, @elems) = @_;
        my @groups;
        my $a = shift @elems;
        my $b;
        my @group = ($a);
        while ($b = shift @elems) {
            if ($fn->($a, $b)) {
                push @group, $b;
            } else {
                push @groups, [@group];
                @group = ($b);
            }
            $a = $b;
        }
        push @groups, [@group];
        return @groups;
    }

This of course has much of the unpleasantness I was complaining about before, but at least it's encapsulated, and allows us to use groupBy neatly. (And we can always come back and clean up the internals later).

(And yes, I know I could define groupBy (&@) so that I could omit the 'sub' around the block like with map/grep. But this is more annoying than useful, as I can't then simply call: groupBy \&function.)

How would you tackle this task? Is there even an elegant way to do it imperatively?