Haskell applicative parsers

Motivation

When I first started to learn Haskell, I was very annoyed by the widespread use of complex abstractions instead of some specific solutions. It seemed to me that it would be much better to always follow the KISS principle and write bicycles using elementary language constructions than to sort through all these types of classes in order to write one supposedly convenient structure somewhere in the end.

I lacked a good example where the efforts spent on mastering the "materiel" would pay off. For me, one of the most successful such examples was the parsers. Now I often talk about them when they ask me for which common tasks you can beautifully use Haskell.

I want to suggest that beginners also go through this path and create from scratch a small base of functions for conveniently implementing parsers, and then use it to write your own parser, the code of which will almost literally repeat the grammar that is being analyzed.

I hope it will help someone to overcome the fear of abstractions and teach them how to use them (yes, I still think that sometimes it is more efficient to write a bicycle).

I have no goal and no desire to make a Haskell course out of the article from scratch, so I assume that the reader is familiar with the syntax and independently developed simple programs. Just in case, I will briefly talk about the type classes before proceeding to the description of the implementation.

For those who have never written in Haskell, but want to understand what is going on here, I recommend that you first look at the corresponding page on Learn X in Y minutes . As an excellent Russian-language book for beginners, I advise Denis Shevchenko, "About Haskell like a human being" .

I will try to use the most simple language constructs that beginners can understand. At the end of the article there is a link to the source repository, where in some parts of the code a more convenient and short record is used, which may be less clear at a glance.

And yes, gentlemen, Haskelists, many things are explained very simply and clumsily, for particular cases, not very abstract, without the use of terms from category theory and other scary words. I am glad that you know them and of course easily mastered them. I also know them, but I do not consider it necessary to throw out such a volume of information in this context to unprepared readers.

Type Classes

Type classes in Haskell have nothing to do with classes in C ++ and other object-oriented languages. If we draw an analogy with OOP, the types of classes are more like the overloading of methods and functions.

Classes define what actions can be performed with objects of the types that are included in the class. For example, all numbers can be compared by equality, but everything can be ordered except complex ones, and functions in general cannot be compared at all. The class of types that can be compared is called Eq , ordered - Ord (types do not have to be numeric). What can be printed by translating into a string belongs to the class Show , it has the “opposite” class Read , which determines how to convert the strings to objects of the desired type.

For a set of standard classes of types (such as Eq , Show , Read ...), you can ask the compiler to implement the necessary functionality in the standard way, using the deriving keyword after determining the type:

 data Point = Point { xCoord :: Float , yCoord :: Float } deriving (Eq, Show)

You can define your own type classes:

 class PrettyPrint a where pPrint :: a -> String

Here PrettyPrint is the class name, a is a type variable. The where keyword is followed by a list of so-called class methods, i.e. functions that can be applied to objects of type from this class.

In order to denote the belonging of a data type to a class, the following construction is used:

 instance PrettyPrint Point where pPrint (Point xy) = "(" ++ show x ++ ", " ++ show y ++ ")"

The language allows you to specify restrictions on the types of classes to which the function arguments should refer:

 showVsPretty :: (Show a, PrettyPrint a) => a -> (String, String) showVsPretty x = (show x, pPrint x)

For each function call, the compiler checks whether these requirements for the type are met, and if it fails, it displays an error (of course, this happens at the compilation stage).

 >>> showVsPretty (Point 2 3) ("Point {xCoord = 2.0, yCoord = 3.0}","(2.0, 3.0)") >>> showVsPretty "str" error: No instance for (PrettyPrint [Char]) arising from a use of 'showVsPretty'

Implementation

The parser receives as input a string that should parse according to predefined rules and get the value of the type we need (for example, an integer). In this case, the input line may not end, and the remainder will serve as an input for further parsing. In addition, our parser will generally be non-deterministic, i.e. will return several possible parse results as a list.

For describing one result of a parser, a two-element tuple (String, a) suitable, where a is a type variable that can designate any custom type.

Since the parser parses the string according to some rules, we describe it as a function that takes a string as input and returns a list of results:

 newtype Parser a = Parser { unParser :: String -> [(String, a)] }

We will consider the parsing successful if the list of results consists of one element and the input string has been completely processed. We implement an auxiliary function that attempts to perform an unambiguous parsing of the entire string:

 parseString :: String -> Parser a -> Maybe a parseString s (Parser p) = case (ps) of [("", val)] -> Just val _ -> Nothing

Simple parsers

We implement several simple parsers, which will then come in handy in building more complex combinations.

We begin by parsing a single character that must satisfy the predicate. If the input string is empty, then the result of the work is an empty list. Otherwise, check the value of the predicate on the first character of the string. If the return value is True , then the parsing result is this character; return it along with the rest of the string. Otherwise, parsing also ends in failure.

 predP :: (Char -> Bool) -> Parser Char predP p = Parser f where f "" = [] f (c : cs) | pc = [(cs, c)] | otherwise = []

Now we can write a parser that accepts a specific character at the beginning of a line. To do this, use the predP just written and give it as an argument a function that compares its argument with the symbol we need:

 charP :: Char -> Parser Char charP char = predP (\c -> c == char)

The following simplest case: a parser that accepts only a specific string entirely. stringP call it stringP . The function inside the parser compares the input string with the required one and, if the lines are equal, returns a list of one element: a pair of empty lines (nothing left at the input) and the original one. Otherwise, the parsing failed, and an empty list of results is returned.

 stringP :: String -> Parser String stringP s = Parser f where fs' | s == s' = [("", s)] | otherwise = []

Quite often, you need to skip characters that have a certain property while they are at the beginning of a line (for example, whitespace characters). At the same time, the result of the analysis is not important to us and will not be useful in the future. We write the skip function, which skips the initial characters of the string, while the true value of the predicate is preserved. As a result of the analysis we use an empty tuple.

 skip :: (Char -> Bool) -> Parser () skip p = Parser (\s -> [(dropWhile ps, ())])

The following two parsers are very similar to each other. Both check the input string prefix, only the first returns the prefix if successful, and the second returns an empty tuple, i.e. allows you to skip a random string at the beginning of the entry For implementation, the isPrefixOf function, defined in the Data.List module, is Data.List .

 prefixP :: String -> Parser String prefixP s = Parser f where f input = if s `isPrefixOf` input then [(drop (length s) input, s)] else [] skipString :: String -> Parser () skipString s = Parser f where f input = if s `isPrefixOf` input then [(drop (length s) input, ())] else []

A little later, we will look at a simpler implementation of the last function and get rid of code duplication.

Parser as a functor

We can distinguish a whole class of container types for which the following is true: if you know how to transform objects inside a container, then you can convert the containers themselves. The simplest example is the list as a container and the map function, which is available in almost all high-level languages. Indeed, you can go through all the elements of the list of type [a] , apply the function a -> b to each and get a list of type [b] .

This type class is called Functor , the class has one fmap method:

 class Functor f where fmap :: (a -> b) -> fa -> fb

Suppose we already know how to parse strings into objects of some type a , and, moreover, we know how to convert objects of type a into objects of type b . Can we say that then there is a parser for objects of type b ?

If we express this in the form of a function, then it will have the following type:

 (a -> b) -> Parser a -> Parser b

This type coincides with the type of the fmap function, so we will try to make the parser a functor. Create from scratch a parser of values of type b , which will first call the first parser (we already have one), and then apply the function to the results of its parsing.

 instance Functor Parser where fmap :: (a -> b) -> Parser a -> Parser b fmap f (Parser p1) = Parser p2 where p2 :: String -> [(String, b)] p2 s = convert (p1 s) convert :: [(String, a)] -> [(String, b)] convert results = map (\(s, val) -> (s, f val)) results

The fmap function has a convenient infix synonym: fmap fx == f <$> x .

If we use a function as an argument for fmap , which simply replaces its first argument with a new value, we will get another useful operation that has already been implemented for all functors even in duplicate (they differ only in the order of the arguments):

 (<$) :: Functor f => a -> fb -> fa ($>) :: Functor f => fa -> b -> fb

Remember the parser that skips a particular string ( skipString )? Now you can implement it as follows:

 skipString :: String -> Parser () skipString s = () <$ prefixP s

Parser Combinations

In Haskell, all functions are curried by default and allow partial use. This means that the function of n arguments is actually a function of one argument, which returns a function of n-1 arguments:

 cons :: Int -> [Int] -> [Int] cons = (:) cons1 :: [Int] -> [Int] cons1 = cons 1 -- функция cons применена частично

Apply the function of three arguments to some value inside the parser, using fmap . Types will be as follows:

 f :: c -> a -> b p :: Parser c (fmap fp) :: Parser (a -> b)

The parser of function turned out ?! Of course, it is possible that the input line actually contains the function representation, but I would like to be able to use this function, or rather combine the Parser (a -> b) parsers Parser (a -> b) and Parser a , to get the Parser b :

 applyP :: Parser (a -> b) -> Parser a -> Parser b

The type of this function is very similar to the type fmap , only the function itself that needs to be applied is also in the container. This gives an intuitive understanding of how the implementation of the applyP function should look like: get the function from the container (as a result of applying the first parser), get the values to which the function should be applied (the result of applying the second parser) and “pack” the values converted by this function back in the container (create a new parser). In the implementation we will use list comprehension:

 applyP :: Parser (a -> b) -> Parser a -> Parser b applyP (Parser p1) (Parser p2) = Parser f where fs = [ (sx, fx) | (sf, f) <- p1 s, -- p1 применяется к исходной строке (sx, x) <- p2 sf] -- p2 применяется к строке, оставшейся после предыдущего разбора

There is a class Applicative , which has a method with the same prototype. The second class method is called pure and is used to "wrap" or "lift" ( lift ) a value, including a functional one. When implemented for a parser, the pure function adds its argument to the result of the parser, without changing the input string.

 class Functor f => Applicative f where pure :: a -> fa (<*>) :: f (a -> b) -> fa -> fb instance Applicative Parser where pure x = Parser (\s -> [(s, x)]) pf <*> px = Parser (\s -> [ (sx, fx) | (sf, f) <- unParser pf $ s, (sx, x) <- unParser px $ sf])

The applyP function is the <*> from the Applicative class. Types belonging to this class are called applicative functors.

For applicative functors, two auxiliary functions are implemented that will be useful to us:

 (*>) :: fa -> fb -> fb (<*) :: fa -> fb -> fa

These functions perform two consecutive actions and return the result of only one of them. For parsers, they can be used, for example, in order to skip leading gaps before parsing the part of the line that carries the meaning.

By combining <$> and <*> , you can create very convenient designs. Consider the following data type:

 data MyStructType = MyStruct { field1 :: Type1 , field2 :: Type2 , field3 :: Type3 }

The constructor of MyStruct values is also a function, in this case it is of type Type1 -> Type2 -> Type3 -> MyStructType . You can work with the constructor as with any other function. Suppose that parsers are already written for the structure field types:

 parser1 :: Parser Type1 parser2 :: Parser Type2 parser3 :: Parser Type3

Using the fmap function, you can partially apply MyStruct to the first of these parsers:

 parserStruct' :: Parser (Type2 -> Type3 -> MyStructType) parserStruct' = MyStruct <$> parser1

Let's try to continue to use the function, which is now "inside" the parser. For this you need to use <*> :

 parserStruct'' :: Parser (Type3 -> MyStructType) parserStruct'' = parserStruct' <*> parser2 parserStruct :: Parser MyStructType parserStruct = parserStruct'' <*> parser3

As a result, we got a parser for the whole structure (of course, here we use the assumption that in the initial line of the presentation of its fields go in a row). The same can be done in one line:

 parserStruct :: Parser MyStructType parserStruct = MyStruct <$> parser1 <*> parser2 <*> parser3

Such designs will often be found in the example of use.

Now suppose that we are trying to write a parser that parses simple arithmetic expressions in which integers and identifiers can be present as operands. Create a separate type of Operand for them:

 data Operand = IntOp Int | IdentOp String

If we can already parse integers and identifiers (for example, as in C), then we need one parser for operands that can parse one or the other. This parser is an alternative of the other two, so we need a function that can combine parsers so that the results of their work are combined. The result of the parser is a list, and the union of lists is their concatenation. We implement the altP function combining two parsers:

 altP :: Parser a -> Parser a -> Parser a altP (Parser p1) (Parser p2) = Parser (\s -> p1 s ++ p2 s)

Then the operand parser can be implemented using this function (here it is assumed that parserInt and parserIdent already described somewhere:

 parserOperand :: Parser Operand parserOperand = altP parserIntOp parserIdentOp where parserIntOp = IntOp <$> parserInt parserIdentOp = IdentOp <$> parserIdent

Of course, for the alternatives have already come up with a separate class, which is called Alternative . There is another method, empty , which describes the neutral element for the alternative operation. In our case, this is a parser that never parses anything, i.e. always returns an empty list of results. For the parser, the implementation of the Alternative class methods looks like this:

 class Applicative f => Alternative f where empty :: fa (<|>) :: fa -> fa -> fa instance Alternative Parser where empty = Parser (const []) px <|> py = Parser (\s -> unParser px s ++ unParser py s)

The <|> operation is the altP function, only in the infix notation, which is more convenient to use, combining several parsers in a row.

For all types in this class, two functions are implemented, some and many type fa -> f [a] . Each of them can be expressed through the other:

 some v = (:) <$> v <*> many v many v = some v <|> pure []

In terms of parsers, these functions allow you to parse data sequences, if you know how to parse one data element. If used, some sequence must be non-empty.

Usage example

Now we are ready to write our parser, for example, for simple arithmetic expressions with the following grammar:

  expr ::= constExpr | binOpExpr | negExpr const ::= int int ::= digit{digit} digit ::= '0' | ... | '9' binOpExpr ::= '(' expr ' ' binOp ' ' expr ')' binOp ::= '+' | '*' negExpr ::= '-' expr

The expression consists of integer constants, unary minus and two infix binary operations: addition and multiplication. Brackets are required around an expression with a binary operation, the operation symbol is separated from the operands by exactly one space, leading and trailing spaces are not allowed.

Examples of correct expression writing:

 "123" "-(10 + 42)" "(1 + ((2 + 3) * (4 + 5)))"

Examples of incorrect entries:

 " 666 " "2 + 3" "(10 * 10)"

We declare the necessary data types (the expression itself and the binary operation):

 data Expr = ConstExpr Int | BinaryExpr Expr Operator Expr | NegateExpr Expr data Operator = Add | Mul

You can start parsing! The expression itself consists of three alternatives. So we write:

 -- expr ::= constExpr | binOpExpr | negExpr exprParser :: Parser Expr exprParser = constParser <|> binParser <|> negParser

The constant is a positive integer. In our data type, it is "wrapped" in the constructor, so we cannot use the parser for an integer directly, but we can use fmap to get the value of the desired type.

 -- const ::= int constParser :: Parser Expr constParser = ConstExpr <$> intParser

The integer, according to the grammar, is represented as a non-empty sequence of numbers. To parse a single digit, we use the auxiliary function predP and the predicate isDigit from the Data.Char module. Now, to build a parser, we use the function some (not many , because there must be at least one number) to parse the sequence of numbers. The result of this parser returns a list of all possible parsing options, starting with the longest entry. For example, if the input string is "123ab", the list of results will be as follows: [("ab", "123"), ("3ab", "12"), ("23ab", "1")] . We need to parse the longest sequence of numbers and convert it to type Int . The whole implementation is as follows:

 -- int ::= digit{digit} -- digit ::= '0' | ... | '9' intParser :: Parser Int intParser = Parser $ \s -> let res = unParser (some digitParser) s in case res of [] -> [] ((rest, i) : xs) -> [(rest, read i)] where digitParser = predP isDigit

The next variant of the expression is the use of a binary operation. According to the grammar, the input string must first include an opening bracket, the first operand, a space, an operation symbol, another space, the second operand, and a closing bracket. To parse individual characters (parentheses and spaces) use the function charP . Operands are expressions, and there is already a parser for them to parse ( exprParser ). To parse the binary operation symbol, we will describe the auxiliary parser just below. It remains to carefully combine this set of parsers. At the beginning and at the end of the expression there should be brackets: you need to check this, but discard the result itself. For this we use *> and <* :

 binParser :: Parser Expr binParser = charP '(' *> ??? <* charP ')'

Between these parsers for the parentheses, the construction of the expression should be done using the BinaryExpr constructor and parsers for the expression and operation. Let's not forget about the spaces around the operation symbol, using the same method as for parentheses. This part is as follows:

 BinaryExpr <$> exprParser -- первый операнд <*> (charP ' ' *> binOpParser <* charP ' ') -- операция, окружённая пробелами <*> exprParser -- второй операнд

Substitute this expression instead of question marks:

 -- binOpExpr ::= '(' expr ' ' binOp ' ' expr ')' binParser :: Parser Expr binParser = charP '(' *> (BinaryExpr <$> exprParser <*> (charP ' ' *> binOpParser <* charP ' ') <*> exprParser ) <* charP ')'

A binary operation is either the + character, which understands the value of Add , or * , which understands Mul :

 -- binOp ::= '+' | '*' binOpParser :: Parser Operator binOpParser = plusParser <|> multParser where plusParser = charP '+' $> Add multParser = charP '*' $> Mul

Remained the simplest part of the grammar, the negation of expression. With the symbol - do the same as with brackets and spaces. Next, we apply the NegateExpr constructor to the result of recursive parsing:

 -- negExpr ::= '-' expr negParser = charP '-' *> (NegateExpr <$> exprParser)

So, all parts of the parser are implemented. The code in many ways resembles a grammar and completely coincides with it in structure.

The source code is available on GitLab: https://gitlab.com/fierce-katie/applicative-parsers-demo .

It is easier to estimate its volume and degree of expressiveness there, since there are far fewer comments. The project can be assembled with the Stack utility and you can launch a primitive interpreter using the parser we wrote:

 $ stack build $ stack exec demo-parser

For those who want to practice further on their own, I can advise the following:

Grammar can be improved in every way, for example, to allow leading and trailing spaces, add new operations, etc.
The parser translates the string into the internal representation of the expression. This expression can be calculated and converted by the interpreter so that it prints not the result of the parsing but the result of the calculation.
Examine the capabilities of the parsec , attoparsec , applicative-parsec and optparse-applicative libraries, try them out.

Thanks for attention!