Spreadsheet imports: imperative vs. functional

I’ve been prototyping ways for customers to import data into our system.
While some kind of JSON/XML structured data would be easiest for us,
our customers like spreadsheets. That’s cool of course, and it’s easy to
mock up imports like this.

Product Price Category
Brie £ 2.00 Dairy
Chablis £ 5.00 Wine

Of course not all data fits nicely with this kind of very flat structure.
For example multiple values. We could do something like

Product Price Tags
Brie £ 2.00 Dairy,Cheese,Food
Chablis £ 5.00 Wine,Alcohol,Drink

but it might be nicer for the customer to organize like this instead:

Product Price Tags
Brie £ 2.00 Dairy
    Cheese
    Food
Chablis £ 5.00 Wine
    Alcohol
    Drink

The problem is that we can’t really parse the table row by row, simply.
We have to remember where the definition started, and finalise when the
item has been completely defined (which will happen when the next item
starts).
I started prototyping this in Perl, and ended up with some typical imperative
code like this:

    my (@products, $product);
    for my $r ( @rows ) {
        my $row = get_row_data($r);
        if ($row->has_product) {
            push @products, $product if $product;
            $product = Product->new(
                product => $row->product,
                price   => $row->price
            );
        } else {
            die "No product" unless $product;
        }
        $product->add_tag($row->tag);
    }
    push @products, $product;

This is quite yucky. See how I’m repeating the push @products line:
first time when seeing a new product, and second when we’ve exited the loop.
Then we have the multiple assignments to $product and so on.

All of which seems to work OK, but I really dislike this logic, and am beginning
to think it’s a perfect example of an imperative antipattern. So, let’s see if
we can do it more elegantly. Thinking about how I’d do it functionally, in
Haskell, I’d use groupBy on each row, connecting the rows up until the
next row has a non-blank product.

So let’s try that in Perl! We’d want something like:

    my @items = groupBy sub {
            my (undef, $next_row) = @_;
            ! $next_row->has_product
        },
        map get_row_data->($_),
            @rows;

    my @products = map {
        my $row_1 = $_->[0];
        Product->new(
            product => $row_1->product,
            price   => $row_1->price,
            tags    => [ map $_->tag, @$_ ],
        );
        }
        @items;

That has less repetition, and less accidental complexity. First we group
the lines. Then we turn the groups of related rows into new objects. Job
done.

The groupBy function is interesting: notice how the callback to it
takes 2 arguments ($this_item, $next_item), though as we don’t care
about the current row, only the next one, we’re just doing (undef,
$next_row)
.

groupBy is also problematic though, in that it doesn’t exist in any of
the normal places I’d have expected (List::Util, List::MoreUtils etc.). So
let’s create it. The ideal way might be to translate Haskell’s definition from
the Prelude, but given that Perlish lists don’t have lazy semantics, and we
then have to implement span etc., let’s just write a noddy imperative
version for now:

    sub groupBy {
        my ($fn, @elems) = @_;
        my @groups;
        my $a = shift @elems;
        my $b;
        my @group = ($a);
        while ($b = shift @elems) {
            if ($fn->($a, $b)) {
                push @group, $b;
            } else {
                push @groups, [@group];
                @group = ($b);
            }
            $a = $b;
        }
        push @groups, [@group];
        return @groups;
    }

This of course has much of the unpleasantness I was complaining about before,
but at least it’s encapsulated, and allows us to use groupBy neatly.
(And we can always come back and clean up the internals later).

(And yes, I know I could define groupBy (&@) so that I could omit the
sub‘ around the block like with map/grep. But this is more annoying
than useful, as I can’t then simply call: groupBy \&function.)

How would you tackle this task? Is there even an elegant way to do it
imperatively?

Comments

  1. Evan Carroll says:

    I’m downvoting this on reddit simply because without a sample raw input and code it seems like a bad attempt to solicit feed back. Put the sample code, and raw examples up on gist.github.com or something…

  2. osfameron says:

    @EvanCarroll, interesting comment. That said, this was for a piece of $real_work, and the code/samples above are extracted from that and simplified a bit. Turning it into a mini-project with sample code and data would be a nice-to-have, but I don’t feel it’s vital to have those for every blog post you make…

    And yes, I was soliciting feedback, in the sense that I’d like to know how other people would approach the problem: I’ve already presented 2 possible ways to do it (1 of which I’m using myself), so I don’t feel like I’m holding much back. Am I missing your point?

  3. Evan Carroll says:

    Yes, with sample data. I’d be able to show you a mock up of how a spreadsheet import would be done: for csv — it is as simple as Text::CSV which can handle headers even and return a hash ref for each row. For fixed width stuff, try my DataExtract::FixedWidth. As for a statefull parser you don’t have to do all this stuff you’re doing. All you have to do is isolate a primary key and the parse the row using the same criteria, if that primary key is null you append the previous hashref or perform appropriately. But again, without a code sample, and raw input it becomes more work to throw new approaches.

    How would you tackle this task? Is there even an elegant way to do it imperatively?

    I’m just not sure how to answer that, if there was no real code, or real input data to present the answer with.

  4. RandomGuy says:

    Would you mind tagging your perl stuff somehow so it doesn’t show up on the haskell world blog?

  5. Olivier Mengué (dolmen) says:

    @Evan: for fixed with stuff there is also my own module: Text::FixedLengthMultiline