Some progressive Haskell hackers may wish to switch from Parsec to Megaparsec. This tutorial explains practical differences between the two libraries that you will need to address if you choose to undertake the switch. Remember, all the functionality available in Parsec is available in Megaparsec and often in better form.
- Renamed things
- Removed things
- Completely changed things
- Character parsing
- Expression parsing
- What happened to
- What’s next?
You’ll mainly need to replace “Parsec” part in your imports with “Megaparsec”. That’s pretty simple. Typical import section of module that uses Megaparsec looks like this:
-- this module contains commonly useful tools: import Text.Megaparsec -- this module depends on type of data you want to parse, you only need to -- import one of these: import Text.Megaparsec.String -- if you parse ‘String’ import Text.Megaparsec.ByteString -- if you parse strict ‘ByteString’ import Text.Megaparsec.ByteString.Lazy -- if you parse lazy ‘ByteString’ import Text.Megaparsec.Text -- if you parse strict ‘Text’ import Text.Megaparsec.Text.Lazy -- if you parse lazy ‘Text’ -- if you need to parse permutation phrases: import Text.Megaparsec.Perm -- if you need to parse expressions: import Text.Megaparsec.Expr -- if you need to parse languages: import qualified Text.Megaparsec.Lexer as L
So, the only noticeable difference that Megaparsec has no
Text.Megaparsec.Token module which is replaced with
Text.Megaparsec.Lexer, see about this in section “What happened to
Megaparsec introduces more consistent naming scheme, so some things are called differently, but renaming functions is a very easy task, you don’t need to think. Here are renamed items:
Parsec also has many names for the same or similar things. Megaparsec usually has one function per task that does its job well. Here are the items that were removed in Megaparsec and reasons of their removal:
parseFromFile— from file and then parsing its contents is trivial for every instance of
Streamand this function provides no way to use newer methods for running a parser, such as
modifyState— ad-hoc backtracking user state has been eliminated.
tokens, now there is a bit different versions of these functions under the same name.
Consumedare not public data types anymore, because they are low-level implementation details.
runPwere essentially synonyms for
Completely changed things
In Megaparsec 5 the modules
Text.Megaparsec.Error are completely different from those found in Parsec and Megaparsec 4. Take some time to look at documentation of the modules if your use-case requires operations on error messages or positions. You may like the fact that we have well-typed and extensible error messages now.
Streamtype class now have
updatePosmethod that gives precise control over advancing of textual positions during parsing.
Note that argument order of
labelhas been flipped (the label itself goes first now), so you can write now:
myParser = label "my parser" $ ….
… <?> "") idiom to “hide” some “expected” tokens from error messages, use
tokenparser is more powerful, its first argument provides full control over reported error message while its second argument allows to specify how to report missing token in case of empty input stream.
tokensparser allows to control how tokens are compared (yes, we have case-insensitive
unexpectedparser allows to specify precisely what is unexpected in well-typed manner.
Tab width is not hard-coded anymore, use
setTabWidthto change it. Default tab width is
Now you can reliably test error messages, equality for them is now defined properly (in Parsec
Expect "foo"is equal to
Expect "bar"), error messages are also well-typed and customizeable.
To render error message in custom way, call
parseErrorPrettyon error messages.
count' m n pallows you to parse from
Now you have
eitherPout of box.
token-based combinators like
string'backtrack by default, so it’s not necessary to use
trywith them (beginning from
4.4.0). This feature does not affect performance.
failurecombinator allows to fail with arbitrary error message, even with completely custom one.
New character parsers in
Text.Megaparsec.Char may be useful if you work with Unicode:
Ever wanted to have case-insensitive character parsers? Here you go:
makeExprParser has flipped order of arguments: term parser first, operator table second. To specify associativity of infix operators you use one of the three
InfixN— non-associative infix
InfixL— left-associative infix
InfixR— right-associative infix
What happened to
That module was extremely inflexible and thus it has been eliminated. In Megaparsec you have
Text.Megaparsec.Lexer instead, which doesn’t impose anything on user but provides useful helpers. The module can also parse indentation-sensitive languages.
Let’s quickly describe how you go about writing your lexer with
Text.Megaparsec.Lexer. First, you should import the module qualified, we will use
L as its synonym here.
Start writing your lexer by defining what counts as white space in your language.
skipBlockComment can be helpful:
sc :: Parser () -- ‘sc’ stands for “space consumer” sc = L.space (void spaceChar) lineComment blockComment where lineComment = L.skipLineComment "//" blockComment = L.skipBlockComment "/*" "*/"
This is generally called space consumer, often you’ll need only one space consumer, but you can define as many of them as you want. Note that this new module allows you avoid consuming newline characters automatically, just use something different than
void spaceChar as first argument of
space. Even better, you can control what white space is on per-lexeme basis:
lexeme :: Parser a -> Parser a lexeme = L.lexeme sc symbol :: String -> Parser String symbol = L.symbol sc
Note that all tools in Megaparsec work with any instance of
MonadParsec. All commonly useful monad transformers like
WriterT are instances of
MonadParsec out of box. For example, what if you want to collect contents of comments, (say, they are documentation strings of a sort), you may want to have backtracking user state were you put last encountered comment satisfying some criteria, and then when you parse function definition you can check the state and attach doc-string to your parsed function. It’s all possible and easy with Megaparsec:
import Control.Monad.State.Lazy … type MyParser = StateT String Parser skipLineComment' :: MyParser () skipLineComment' = … skipBlockComment' :: MyParser () skipBlockComment' = … sc :: MyParser () sc = space (void spaceChar) skipLineComment' skipBlockComment'
Parsing of indentation-sensitive language deserves its own tutorial, but let’s take a look at basic tools upon which you can build. First of all you should work with space consumer that doesn’t eat newlines automatically. This means you’ll need to pick them up manually.
Main helper is called
indentGuard. It takes parser that will be used to consume white space (indentation) and a predicate of type
Int -> Bool. If after running the given parser column number does not satisfy given predicate, the parser fails with message “incorrect indentation”, otherwise it returns current column number.
In simple cases you can explicitly pass around value returned by
indentGuard, i.e. current level of indentation. If you prefer to preserve some sort of state you can achieve backtracking state combining
ParsecT, like this:
StateT Int Parser a
Here we have state of type
Int. You can use
put as usual, although it may be better to write modified version of
indentGuard that could get current indentation level (indentation level on previous line), then consume indentation of current line, perform necessary checks, and put new level of indentation.
Later update: now we have full support for indentation-sensitive parsing, see
lineFold in the
Character and string literals
Parsing of string and character literals is done a bit differently than in Parsec. You have single helper
charLiteral, which parses character literal. It does not parse surrounding quotes, because different languages may quote character literals differently. Purpose of this parser is to help with parsing of conventional escape sequences (literal character is parsed according to rules defined in Haskell report).
charLiteral :: Parser Char charLiteral = char '\'' *> charLiteral <* char '\''
charLiteral to parse string literals. This is simplified version that will accept plain (not escaped) newlines in string literals (it’s easy to make it conform to Haskell syntax, this is left as an exercise for the reader):
stringLiteral :: Parser String stringLiteral = char '"' >> manyTill L.charLiteral (char '"')
I should note that in
charLiteral we use built-in support for parsing of all the tricky combinations of characters. On the other hand Parsec re-implements the whole thing. Given that it has no proper testing at all, I cannot tell for sure that it works.
Parsing of numbers is easy:
integer :: Parser Integer integer = lexeme L.integer float :: Parser Double float = lexeme L.float number :: Parser Scientific number lexeme L.number -- similar to ‘naturalOrFloat’ in Parsec
Note that Megaparsec internally uses standard Haskell functions to parse floating point numbers, thus no precision loss is possible (and it’s tested). On the other hand, Parsec again re-implements the whole thing. Approach taken by Parsec authors is just parse the numbers one by one and then re-create the floating point number by means of floating point arithmetic. Any professional knows that this is not possible and the only way to parse floating point number is via bit-level manipulation (it’s usually done on OS level, in C libraries). Of course results produced by Parsec built-in parser for floating point numbers are incorrect. This is a known bug now, but it’s been a long time till we “discovered” it, because again, Parsec has no test suite. (Update: it took one year but Parsec’s maintainer has recently merged a pull request that seems to fix that and released Parsec 3.1.11.)
Hexadecimal and octal numbers do not parse “0x” or “0o” prefixes, because different languages may have other prefixes for this sort of numbers. We should parse the prefixes manually:
hexadecimal :: Parser Integer hexadecimal = lexeme $ char '0' >> char' 'x' >> L.hexadecimal octal :: Parser Integer octal = lexeme $ char '0' >> char' 'o' >> L.octal
Since Haskell report says nothing about sign in numeric literals, basic parsers like
integer do not parse sign. You can easily create parsers for signed numbers with help of
signedInteger :: Parser Integer signedInteger = L.signed sc integer signedFloat :: Parser Double signedFloat = L.signed sc float signedNumber :: Parser Scientific signedNumber = L.signed sc number
And that’s it, shiny and new,
Text.Megaparsec.Lexer is at your service, now you can implement anything you want without the need to copy and edit entire
Text.Parsec.Token module (people had to do it sometimes, you know).
Changes you may want to perform may be more fundamental than those described here. For example, previously you may have to use a workaround because
Text.Parsec.Token was not sufficiently flexible. Now you can replace it with proper solution. If you want to use full potential of Megaparsec, take time to read about its features, they can help you improve your code.