I’ve been prototyping ways for customers to import data into our system.
While some kind of JSON/XML structured data would be easiest for us,
our customers like spreadsheets. That’s cool of course, and it’s easy to
mock up imports like this.
Product | Price | Category |
---|---|---|
Brie | £ 2.00 | Dairy |
Chablis | £ 5.00 | Wine |
Of course not all data fits nicely with this kind of very flat structure.
For example multiple values. We could do something like
Product | Price | Tags |
---|---|---|
Brie | £ 2.00 | Dairy,Cheese,Food |
Chablis | £ 5.00 | Wine,Alcohol,Drink |
but it might be nicer for the customer to organize like this instead:
Product | Price | Tags |
---|---|---|
Brie | £ 2.00 | Dairy |
Cheese | ||
Food | ||
Chablis | £ 5.00 | Wine |
Alcohol | ||
Drink |
The problem is that we can’t really parse the table row by row, simply.
We have to remember where the definition started, and finalise when the
item has been completely defined (which will happen when the next item
starts).
I started prototyping this in Perl, and ended up with some typical imperative
code like this:
my (@products, $product);
for my $r ( @rows ) {
my $row = get_row_data($r);
if ($row->has_product) {
push @products, $product if $product;
$product = Product->new(
product => $row->product,
price => $row->price
);
} else {
die "No product" unless $product;
}
$product->add_tag($row->tag);
}
push @products, $product;
This is quite yucky. See how I’m repeating the push @products line:
first time when seeing a new product, and second when we’ve exited the loop.
Then we have the multiple assignments to $product and so on.
All of which seems to work OK, but I really dislike this logic, and am beginning
to think it’s a perfect example of an imperative antipattern. So, let’s see if
we can do it more elegantly. Thinking about how I’d do it functionally, in
Haskell, I’d use groupBy on each row, connecting the rows up until the
next row has a non-blank product.
So let’s try that in Perl! We’d want something like:
my @items = groupBy sub {
my (undef, $next_row) = @_;
! $next_row->has_product
},
map get_row_data->($_),
@rows;
my @products = map {
my $row_1 = $_->[0];
Product->new(
product => $row_1->product,
price => $row_1->price,
tags => [ map $_->tag, @$_ ],
);
}
@items;
That has less repetition, and less accidental complexity. First we group
the lines. Then we turn the groups of related rows into new objects. Job
done.
The groupBy function is interesting: notice how the callback to it
takes 2 arguments ($this_item, $next_item), though as we don’t care
about the current row, only the next one, we’re just doing (undef,
$next_row).
groupBy is also problematic though, in that it doesn’t exist in any of
the normal places I’d have expected (List::Util, List::MoreUtils etc.). So
let’s create it. The ideal way might be to translate Haskell’s definition from
the Prelude, but given that Perlish lists don’t have lazy semantics, and we
then have to implement span etc., let’s just write a noddy imperative
version for now:
sub groupBy {
my ($fn, @elems) = @_;
my @groups;
my $a = shift @elems;
my $b;
my @group = ($a);
while ($b = shift @elems) {
if ($fn->($a, $b)) {
push @group, $b;
} else {
push @groups, [@group];
@group = ($b);
}
$a = $b;
}
push @groups, [@group];
return @groups;
}
This of course has much of the unpleasantness I was complaining about before,
but at least it’s encapsulated, and allows us to use groupBy neatly.
(And we can always come back and clean up the internals later).
(And yes, I know I could define groupBy (&@) so that I could omit the
‘sub‘ around the block like with map/grep. But this is more annoying
than useful, as I can’t then simply call: groupBy \&function.)
How would you tackle this task? Is there even an elegant way to do it
imperatively?