Spreadsheet imports: imperative vs. functional — Just another lambdabananacamel,

I’ve been prototyping ways for customers to import data into our system.
While some kind of JSON/XML structured data would be easiest for us,
our customers like spreadsheets. That’s cool of course, and it’s easy to
mock up imports like this.

Product	Price	Category
Brie	£ 2.00	Dairy
Chablis	£ 5.00	Wine

Of course not all data fits nicely with this kind of very flat structure.
For example multiple values. We could do something like

Product	Price	Tags
Brie	£ 2.00	Dairy,Cheese,Food
Chablis	£ 5.00	Wine,Alcohol,Drink

but it might be nicer for the customer to organize like this instead:

Product	Price	Tags
Brie	£ 2.00	Dairy
		Cheese
		Food
Chablis	£ 5.00	Wine
		Alcohol
		Drink

The problem is that we can’t really parse the table row by row, simply.
We have to remember where the definition started, and finalise when the
item has been completely defined (which will happen when the next item
starts).
I started prototyping this in Perl, and ended up with some typical imperative
code like this:

    my (@products, $product);
    for my $r ( @rows ) {
        my $row = get_row_data($r);
        if ($row->has_product) {
            push @products, $product if $product;
            $product = Product->new( 
                product => $row->product, 
                price   => $row->price 
            );
        } else {
            die "No product" unless $product;
        }
        $product->add_tag($row->tag);
    }
    push @products, $product;

This is quite yucky. See how I’m repeating the push @products line:
first time when seeing a new product, and second when we’ve exited the loop.
Then we have the multiple assignments to $product and so on.

All of which seems to work OK, but I really dislike this logic, and am beginning
to think it’s a perfect example of an imperative antipattern. So, let’s see if
we can do it more elegantly. Thinking about how I’d do it functionally, in
Haskell, I’d use groupBy on each row, connecting the rows up until the
next row has a non-blank product.

So let’s try that in Perl! We’d want something like:

    my @items = groupBy sub { 
            my (undef, $next_row) = @_;
            ! $next_row->has_product
        },
        map get_row_data->($_),
            @rows;

    my @products = map {
        my $row_1 = $_->[0];
        Product->new(
            product => $row_1->product,
            price   => $row_1->price,
            tags    => [ map $_->tag, @$_ ],
        );
        } 
        @items;

That has less repetition, and less accidental complexity. First we group
the lines. Then we turn the groups of related rows into new objects. Job
done.

The groupBy function is interesting: notice how the callback to it
takes 2 arguments ($this_item, $next_item), though as we don’t care
about the current row, only the next one, we’re just doing (undef, $next_row).

groupBy is also problematic though, in that it doesn’t exist in any of
the normal places I’d have expected (List::Util, List::MoreUtils etc.). So
let’s create it. The ideal way might be to translate Haskell’s definition from
the Prelude, but given that Perlish lists don’t have lazy semantics, and we
then have to implement span etc., let’s just write a noddy imperative
version for now:

    sub groupBy {
        my ($fn, @elems) = @_;
        my @groups;
        my $a = shift @elems;
        my $b;
        my @group = ($a);
        while ($b = shift @elems) {
            if ($fn->($a, $b)) {
                push @group, $b;
            } else {
                push @groups, [@group];
                @group = ($b);
            }
            $a = $b;
        }
        push @groups, [@group];
        return @groups;
    }

This of course has much of the unpleasantness I was complaining about before,
but at least it’s encapsulated, and allows us to use groupBy neatly.
(And we can always come back and clean up the internals later).

(And yes, I know I could define groupBy (&@) so that I could omit the
‘sub‘ around the block like with map/grep. But this is more annoying
than useful, as I can’t then simply call: groupBy \&function.)

How would you tackle this task? Is there even an elegant way to do it
imperatively?

Comments

Evan Carroll says:

December 29, 2009 at 7:58 pm

I’m downvoting this on reddit simply because without a sample raw input and code it seems like a bad attempt to solicit feed back. Put the sample code, and raw examples up on gist.github.com or something…
osfameron says:

December 30, 2009 at 1:35 pm

@EvanCarroll, interesting comment. That said, this was for a piece of $real_work, and the code/samples above are extracted from that and simplified a bit. Turning it into a mini-project with sample code and data would be a nice-to-have, but I don’t feel it’s vital to have those for every blog post you make…

And yes, I was soliciting feedback, in the sense that I’d like to know how other people would approach the problem: I’ve already presented 2 possible ways to do it (1 of which I’m using myself), so I don’t feel like I’m holding much back. Am I missing your point?
Evan Carroll says:

December 30, 2009 at 6:12 pm

Yes, with sample data. I’d be able to show you a mock up of how a spreadsheet import would be done: for csv — it is as simple as Text::CSV which can handle headers even and return a hash ref for each row. For fixed width stuff, try my DataExtract::FixedWidth. As for a statefull parser you don’t have to do all this stuff you’re doing. All you have to do is isolate a primary key and the parse the row using the same criteria, if that primary key is null you append the previous hashref or perform appropriately. But again, without a code sample, and raw input it becomes more work to throw new approaches.

How would you tackle this task? Is there even an elegant way to do it imperatively?

I’m just not sure how to answer that, if there was no real code, or real input data to present the answer with.
RandomGuy says:

December 31, 2009 at 11:39 am

Would you mind tagging your perl stuff somehow so it doesn’t show up on the haskell world blog?
Olivier MenguÃ© (dolmen) says:

March 9, 2011 at 11:39 am

@Evan: for fixed with stuff there is also my own module: Text::FixedLengthMultiline