Haskell 'words' and Perl 'split'

Haskell’s prelude has a function words that splits a string by spaces.

 Prelude> words "nice cup of tea" ["nice","cup","of","tea"]

Apparently the question comes up quite regularly on irc or
haskell-cafe as to why this function is specialised to split only on
whitespace. Perl’s split, for example, can split on any character, or indeed string or regular expression.

As quicksilver has suggested, the split function is more complicated than you might think:

  • character/string/regular expression
  • include the split token in the list?
  • collapse empty tokens?

and therefore perhaps the reason that there is no general function, is that it would require a very complex type.

I thought about this some more, and if I’ve got anything at all
about functional programming, it’s that you can build up progressively
more complicated functions from smaller pieces. I wondered if I could
do the same with split.

After playing a little with unfoldr, I decided I was better
off using a simple recursive solution for now until I understand it
better. But the first thing I need is a function to split a string just
once.

 > splitOne = splitOne' [] >  where splitOne' acc p []         = (acc, Nothing, Nothing)                                                              >        splitOne' acc p xs@(x:xs') =  >            let m = p xs                                                                         >            in case m of                                                                                    >                Just (s, rest) -> (acc, Just s, Just rest)                                                  >                Nothing        -> splitOne' (acc++[x]) p xs'

splitOne takes a function which either returns the separator and the
rest of the string, or Nothing. We iterate the list of characters,
stopping at the first where this function matches. Examples of these
functions:

 > onCharP p xs@(x:xs') | p x    = Just ([x], xs') >                      | otherwise = Nothing                                                                                  > onChar c = onCharP (==c)
 > onSpace = onCharP isSpace > onComma = onChar  ','

At which point we can do:

 *Main> splitOne onSpace "nice cup of tea" ("nice",Just " ",Just "cup of tea") *Main> splitOne onSpace "nice" ("nice",Nothing,Nothing)

So we now need to run for the whole length of the string, which is where the actual split function comes in.

 > split t p [] = [] > split t p xs = let (tok,sep,rest) = splitOne p xs >                    res = t (tok,sep) >                 in  case rest of >                      Nothing    -> res >                      Just rest' -> res ++ split t p rest'

split takes a transformation function as well as a
predicate. This takes the lists of (separator,token) and transforms
them as required.

 > onlyToken :: (t, t1) -> [t] > onlyToken (x,_)  = [x]
 > -- onlyWord ("",_) = [] > onlyWord ([],_) = [] > onlyWord (x, _) = [x]
 > tokenAndSep (t, Nothing) = [t] > tokenAndSep (t, Just s)  = [t,s]

This means that you can write the words function, as well as a function to split on commas, with different behaviours.

 > words  = split onlyWord    onSpace > commas = split tokenAndSep onComma

As quicksilver suggested, split does indeed have a rather complicated type:

 *Main> :t split split :: (([a1], Maybe a2) -> [a]) -> ([a1] -> Maybe (a2, [a1])) -> [a1] -> [a]

but the final function is simple enough. I did promise that we’d be
able to split on words as well as characters, and this is why splitOne
runs the predicate against xs instead of just the head of the list.

 > onPrefix :: Eq a => [a] -> [a] -> Maybe ([a], [a]) > onPrefix = onPrefix' [] > where onPrefix' :: Eq a => [a] -> [a] -> [a] -> Maybe ([a], [a])            >        onPrefix' acc [] s2 = Just (acc, s2) >        onPrefix' acc _  [] = Nothing                                         >        onPrefix' acc s1 s2  >           | (head s1) == (head s2)  >                 = onPrefix' (acc++[head s1]) (tail s1) (tail s2) >           | otherwise = Nothing

Which gives us:

 *Main> split onlyWord (onPrefix "and") "sex and drugs and rock and roll" ["sex "," drugs "," rock "," roll"]

OK, this is still missing two important things from the Perl function:

  • split on a regexp. You could use some parser combinator as the splitOne predicate.
  • split a certain number of fields and then stop, like so:
      DB> x split /,/, "red,green,yellow,blue", 3  0  'red'  1  'green'  2  'yellow,blue'

    This one, I haven’t really thought enough about how to implement, without complicating things.

As usual when I start on Haskell, I’m missing some obvious way to
make the API not suck horribly, so comments and criticism very welcome.

Update Nov 15:

Thanks to everyone who responded here and on reddit (it is a constant source of pleasure and amazement that people find these silly posts worth commenting on :-)

Andrew, vincentk, and geezusfreak commented that splitting a string
is really a form of parsing, and therefore they would use a proper
parser library like Parsec. Good point, though split in Perl is often
“just good enough” as a lightweight solution. I’d like to see a
solution in Parsec or similar though.

Mamie Camacho suggested to use Text.Regex:

I had a play with this and it’s quite simple to do something like:

 > splitRegex (mkRegex "\\s*,\\s*") "eggs,ham, whatever"

Haskell’s string quoting is less pleasant than Perl’s (having to
quote the backtick) and this version doesn’t seem to have the semantics
of keeping the separator in capture brackets. (e.g. if you use the
string “(\\s*,\\s*)“)

Other good suggestions included starting from the source of words and deriving a more general solution from that. Thanks again!