Sunday, December 31, 2006

 

A simple regex engine in Haskell

UPDATE: sorear from #haskell pasted a cool version of this here. His version is a Parsec Parser that returns another Parsec Parser! How cool is that? His version doesn't actually return the matches themselves, just a Bool, but still, it's a clever hack.

I recently wrote a simple regex engine (and parser) in Haskell. The implementation is far from optimal, but I'm still pretty excited about how easy it was, especially the parser. Thanks to psykotic for the inspiration: Without his regex engine in 14 lines of Python as an example, I probably wouldn't have tried something like this.

I'm going to walk over the implementation and give examples. The impatient can skip down to the end to see the full source.

import Text.ParserCombinators.Parsec

data Match = MkMatch String String deriving (Show, Eq)

type Matcher = Match -> [Match]

We'll need Text.ParserCombinators.Parsec later. The "Match" datatype has a single constructor, "MkMatch" (The constructor could have simply been called "Match", but I found that confusing), which takes two strings: The first string is the part of the text that matched, and the second string is the remainder. And the "Matcher" type synonym is for pretty type signatures. Armed with these types, we can write a function that matches (or does not match) a single character:

matchOne :: (Char -> Bool) -> Matcher
matchOne _ (MkMatch _ "") = []
matchOne f (MkMatch xs (y:ys))
  | f y = [MkMatch (xs ++ [y]) ys]
  | otherwise = []

matchOne takes a predicate function and returns a Matcher. We can use it like so:

*Main> matchOne (=='x') (MkMatch "" "xyz")
[MkMatch "x" "yz"]

That works fine to just match one thing, but it's hard to chain things together. We can take advantage of the fact that Haskell's built-in list datatype is a monad:

*Main> [MkMatch "" "xyz"] >>= matchOne (=='x')          
[MkMatch "x" "yz"]
*Main> [MkMatch "" "xyz"] >>= matchOne (=='x') >>= matchOne (=='y') >>= matchOne (=='z')
[MkMatch "xyz" ""]
*Main> [MkMatch "" "xyz"] >>= matchOne (=='x') >>= matchOne (=='y') >>= matchOne (=='z') >>= matchOne (=='a')
[]

Saying =='x' all the time is tedious:

rechar :: Char -> Matcher
rechar c = matchOne (==c)

Example:

*Main> [MkMatch "" "xyz"] >>= rechar 'x' >>= rechar 'y' >>= rechar 'z'                                       
[MkMatch "xyz" ""]

Now that we have matchOne, it's trivial to define the . regex metacharacter:

dot :: Matcher
dot = matchOne (const True)

"dot" is a Matcher that always matches one character. Matching single characters is all fine and dandy, but it's about time we had some more interesting way of combinding Matchers. Here is a simple (and wrong) implementation of the * quantifier:

badstar :: Matcher -> Matcher
badstar f x = (f x >>= badstar f) ++ [x]

It seems to work fine for simple cases:

*Main> [MkMatch "" "xxx123"] >>= badstar (rechar 'x')
[MkMatch "xxx" "123",MkMatch "xx" "x123",MkMatch "x" "xx123",MkMatch "" "xxx123"]

Notice how it returns all the matches, not just the first one. Pretty cool, eh?

But consider this (even simpler) case:

*Main> [MkMatch "" ""] >>= badstar (rechar 'x')
[MkMatch "" ""]

No real surprise here: * means "zero or more, as many as possible". So we're just seeing a zero-width match. No problem, right? Wrong. Look at this:

*Main> [MkMatch "" ""] >>= badstar (badstar (rechar 'x'))
*** Exception: stack overflow

Non-termination rears its ugly head. The problem with badstar is that whatever we're trying to match zero or more of might also have a zero-width match. In other words, in f x >>= badstar f, the list returned by f x may contain x! This leads to infinite recursion. My solution is to simply filter x out of f x:

star :: Matcher -> Matcher
star f x = (filter (/=x) (f x) >>= star f) ++ [x]

Which works how we want it to:

*Main> [MkMatch "" ""] >>= star (star (rechar 'x'))
[MkMatch "" ""]
*Main> [MkMatch "" "xx"] >>= star (star (rechar 'x'))
[MkMatch "xx" "",MkMatch "xx" "",MkMatch "x" "x",MkMatch "" "xx"]

However, I am not 100% sure that this solution covers every case. If anybody has any insight on this, I'd love to hear it.

Moving along, we can write combinators for alteration and sequencing:

alt :: [Matcher] -> Matcher
alt fs x = concatMap ($ x) fs

ordered :: [Matcher] -> Matcher
ordered fs x = foldl (>>=) [x] fs

alt and ordered both "reduce" a list of Matchers to a single Matcher. Remember that Matcher is just Match -> [Match], so we can do concatMap ($ x) fs with no problem.

It seems like ordered is probably a builtin function somewhere, but I couldn't find a standard equivalent.

Now we can write, for example, c(a|d)*r

*Main> [MkMatch "" "caddr"] >>= ordered [ rechar 'c', star (alt [ rechar 'a', rechar 'd' ]), rechar 'r' ]
[MkMatch "caddr" ""]

Rounding out our metacharacters are + and ?

plus :: Matcher -> Matcher
plus f = ordered [ f, star f ]

reoptional :: Matcher -> Matcher
reoptional f = alt [ f, return ]

And that's the basic engine. It's not very good: It use backtracking, but doesn't support any of the fun features (eg, backreferences) that a backtracking regex engine is supposed to have. I haven't done any benchmarks or profiling, but I doubt it's very fast. Also, there's not any sort of anchoring: everything has to match at the beginning the of string. But none of that matters: It's just a toy for me to learn about backtracking.

Now for the parser:

literalchar :: Parser Char
literalchar = letter <|> digit <|> oneOf " ,-_;:'\"" 

parseregex :: Parser Matcher
parseregex = do
  xs <- sepBy1 (many1 atomWithMod) (char '|')
  return (alt (map ordered xs))

parsechar :: Parser Matcher
parsechar = literalchar >>= (return . rechar)

atomWithMod :: Parser Matcher
atomWithMod = do
  a <- atom
  choice (map (\(c, f) -> char c >> return (f a)) mods) <|> return a
  where
    mods = [('+', plus), ('*', star), ('?', reoptional)]

atom :: Parser Matcher
atom = subexpression <|> charclass <|> wildcard <|> parsechar

subexpression :: Parser Matcher
subexpression = do
  char '('
  expr <- parseregex
  char ')'
  return expr 

charclass :: Parser Matcher
charclass = do
  char '['
  xs <- many1 ((try range) <|> parsechar)
  char ']'
  return (alt xs)

range :: Parser Matcher
range = do
  from <- literalchar
  char '-'
  to <- literalchar
  return (matchOne (`elem` [from..to]))

wildcard :: Parser Matcher
wildcard = char '.' >> return dot

matches :: String -> String -> [Match]
matches re str = [MkMatch "" str] >>= (runparser parseregex re)

runparser :: GenParser tok () a -> [tok] -> a
runparser p input = case (parse p "" input) of
  Right x -> x
  Left err -> error (show err)

I'm not going to talk about the parser very much because it doesn't seem nearly as interesting. The parser is fairly simple. It supports subexpressions, character classes, alternation, *, +, and ?. It doesn't support escaping characters, (ie, "\." to match a literal dot character.) but that would be fairly easy to add.

Some fun using the matches helper function:

*Main> matches "[0-9]+(,[0-9]+)*" "1,230,058" 
[MkMatch "1,230,058" "",MkMatch "1,230,05" "8",MkMatch "1,230,0" "58",MkMatch "1,230" ",058",MkMatch "1,23" "0,058",MkMatch "1,2" "30,058",MkMatch "1" ",230,058"]
*Main> matches "((great )*grand )?((fa|mo)ther)" "great great grand mother" 
[MkMatch "great great grand mother" ""]
*Main> matches "c[ad]+r" "caddr" 
[MkMatch "caddr" ""]
*Main> matches "(x*)*" "x" 
[MkMatch "x" "",MkMatch "" "x"]

And that's it. If anybody has any feedback or ideas on how to simplify it, I'd love to hear them. Oh, and here's the full source:

import Text.ParserCombinators.Parsec

data Match = MkMatch String String deriving (Show, Eq)

type Matcher = Match -> [Match]

matchOne :: (Char -> Bool) -> Matcher
matchOne _ (MkMatch _ "") = []
matchOne f (MkMatch xs (y:ys))
  | f y = [MkMatch (xs ++ [y]) ys]
  | otherwise = []

rechar :: Char -> Matcher
rechar c = matchOne (==c)

dot :: Matcher
dot = matchOne (const True)

badstar :: Matcher -> Matcher
badstar f x = (f x >>= badstar f) ++ [x]

star :: Matcher -> Matcher
star f x = (filter (/=x) (f x) >>= star f) ++ [x]

alt :: [Matcher] -> Matcher
alt fs x = concatMap ($ x) fs

ordered :: [Matcher] -> Matcher
ordered fs x = foldl (>>=) [x] fs

plus :: Matcher -> Matcher
plus f = ordered [ f, star f ]

reoptional :: Matcher -> Matcher
reoptional f = alt [ f, return ]

literalchar :: Parser Char
literalchar = letter <|> digit <|> oneOf " ,-_;:'\"" 

parseregex :: Parser Matcher
parseregex = do
  xs <- sepBy1 (many1 atomWithMod) (char '|')
  return (alt (map ordered xs))

parsechar :: Parser Matcher
parsechar = literalchar >>= (return . rechar)

atomWithMod :: Parser Matcher
atomWithMod = do
  a <- atom
  choice (map (\(c, f) -> char c >> return (f a)) mods) <|> return a
  where
    mods = [('+', plus), ('*', star), ('?', reoptional)]

atom :: Parser Matcher
atom = subexpression <|> charclass <|> wildcard <|> parsechar

subexpression :: Parser Matcher
subexpression = do
  char '('
  expr <- parseregex
  char ')'
  return expr 

charclass :: Parser Matcher
charclass = do
  char '['
  xs <- many1 ((try range) <|> parsechar)
  char ']'
  return (alt xs)

range :: Parser Matcher
range = do
  from <- literalchar
  char '-'
  to <- literalchar
  return (matchOne (`elem` [from..to]))

wildcard :: Parser Matcher
wildcard = char '.' >> return dot

matches :: String -> String -> [Match]
matches re str = [MkMatch "" str] >>= (runparser parseregex re)

runparser :: GenParser tok () a -> [tok] -> a
runparser p input = case (parse p "" input) of
  Right x -> x
  Left err -> error (show err)

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?