Search This Blog

Monday, July 4, 2016

A Little 25 year old Functional Parser Gem

1 Intro

When I was playing around inside Mark Jones' gofer sources in the early 90s I saw this piece of commented C code. The code was impenetrable… at first. But the comment was elegant and beautiful and cleared up the code nicely.. on careful reading.

Why is it needed?

Well gofer is parsed with a conventional yacc parser.
Problem is gofer (like Haskell) allows for runtime declared binary operators with
3 fixities infixr, infixl and infix
So

infixr 8 ^
infixl 7 *

means ^ is at precedence level 8 and right-associates
whereas * is at level 7 and left associates
Plain infix means non-associative; ie its a syntax error to use the operator associatively.

So the yacc parser collects all operators with one single precedence and associativity treated as left and then regroups the tree based on the fixity declarations.

Mark Jones' original gofer Gofer with pugofer modifications
I give this here as a beautiful example of how functional programming can clear up arcane imperative programming

2 C Code

static Cell local TidyInfix(e)       /* convert InfixExpr to Expr        */
Cell e; {                            /* :: InfixExpr                     */
    Cell   s   = NIL;                /* :: TidyStack                     */
    Syntax sye = APPLIC;             /* Syntax of op in e (init unknown) */
    Syntax sys = APPLIC;             /* Syntax of op in s (init unknown) */
    Cell   temp;

    while (nonNull(tl(e))) {
        if (isNull(s)) {
            s           = e;
            e           = arg(fun(s));
            arg(fun(s)) = NIL;
            sys         = sye;
            sye         = APPLIC;
        }
        else {
            if (sye==APPLIC) {         /* calculate sye (if unknown)       */
                sye = syntaxOf(textOf(fun(fun(e))));
                if (sye==APPLIC) sye=DEF_OPSYNTAX;
            }
            if (sys==APPLIC) {         /* calculate sys (if unknown)       */
                sys = syntaxOf(textOf(fun(fun(s))));
                if (sys==APPLIC) sys=DEF_OPSYNTAX;
            }

            if (precOf(sye)==precOf(sys) &&                      /* amb    */
                   (assocOf(sye)!=assocOf(sys) || assocOf(sye)==NON_ASS)) {
                ERROR(row) "Ambiguous use of operator \"%s\" with \"%s\"",
                           textToStr(textOf(fun(fun(e)))),
                           textToStr(textOf(fun(fun(s))))
                EEND;
            }
            else if (precOf(sye)>precOf(sys) ||                  /* shift  */
                       (precOf(sye)==precOf(sys) && assocOf(sye)==LEFT_ASS)) {
                temp        = arg(fun(e));
                arg(fun(e)) = s;
                s           = e;
                e           = temp;
                sys         = sye;
                sye         = APPLIC;
            }
            else {                                               /* reduce */
                temp        = arg(fun(s));
                arg(fun(s)) = arg(e);
                arg(e)      = s;
                s           = temp;
                sys         = APPLIC;
                /* sye unchanged */
            }
        }
    }

    e = hd(e);
    while (nonNull(s)) {
        temp        = arg(fun(s));
        arg(fun(s)) = e;
        e           = s;
        s           = temp;
    }

    return e;
}

3 Explanation

/* expressions involving a sequence of two or more infix operator symbols
 * are parsed as elements of type:
 *    InfixExpr ::= [Expr]
 *               |  ap(ap(Operator,InfixExpr),Expr)
 *
 * thus x0 +1 x1 ... +n xn is parsed as: +n (....(+1 [x0] x1)....) xn
 *
 * Once the expression has been completely parsed, this parsed form is
 * `tidied' according to the precedences and associativities declared for
 * each operator symbol.
 *
 * The tidy process uses a `stack' of type:
 *    TidyStack ::= ap(ap(Operator,TidyStack),Expr)
 *               |  NIL
 * when the ith layer of an InfixExpr has been transferred to the stack, the
 * stack is of the form: +i (....(+n NIL xn)....) xi
 *
 * The tidy function is based on a simple shift-reduce parser:
 *
 *  tidy                :: InfixExpr -> TidyStack -> Expr
 *  tidy [m]   ss        = foldl (\x f-> f x) m ss
 *  tidy (m*n) []        = tidy m [(*n)]
 *  tidy (m*n) ((+o):ss)
 *             | amb     = error "Ambiguous"
 *             | shift   = tidy m ((*n):(+o):ss)
 *             | reduce  = tidy (m*(n+o)) ss
 *                         where sye     = syntaxOf (*)
 *                               (ae,pe) = sye
 *                               sys     = syntaxOf (+)
 *                               (as,ps) = sys
 *                               amb     = pe==ps && (ae/=as || ae==NON_ASS)
 *                               shift   = pe>ps || (ps==pe && ae==LEFT_ASS)
 *                               reduce  = otherwise
 *
 * N.B. the conditions amb, shift, reduce are NOT mutually exclusive and
 * must be tested in that order.
 *
 * As a concession to efficiency, we lower the number of calls to syntaxOf
 * by keeping track of the values of sye, sys throughout the process.  The
 * value APPLIC is used to indicate that the syntax value is unknown.
 */

4 Yacc usage

And this is how the yacc usage of tidyInfix occurs

exp   : opExp COCO sigType           {$$ = ap(ESIGN,pair($1,$3));}
      | opExp                        {$$ = $1;}
      | error                        {syntaxError("expression");}
      ; 
opExp  : pfxExp                      {$$ = $1;}
       | pfxExp op pfxExp            {$$ = opapd($1, $2, $3);}
       | opExp0                      {$$ = tidyInfix($1);}
       ;
opExp0  : opExp0 op pfxExp           {$$ = opap($1, $2, $3);}
        | pfxExp op pfxExp op pfxExp {$$ = 
                   opap(opap(singleton($1), $2, $3),$4,$5);}
etc

5 Haskell (Modified gofer)

On grappling with it I felt it would be well to goferize (haskellize) it.
Here is the literate source I wrote as an executable — not exactly same version but and analogous intention to mpj's original.

1. Preliminaries
----------------
This script demonstrates how to write a simple shift-reduce parser.
In the interests of simplicity, we handle expressions with single-digit
operands and operators from the set {^, *, +, <, &}

The parser will build trees of the following type:

> ctype Exp where
>     S : Int -> Exp
>     Pow, Mul, Add, Less, And : Exp -> Exp -> Exp

The precedences as we may expect are as follows:

> opPrecAl = [('^', 5), ('*', 4), ('+', 3), ('<', 2), ('&', 1)]

We must also handle associativities of operators. Operators may be left,
right or non-associative. The following enumerated type captures this:

> ctype Associativity where
>     L, R, N : Associativity

The operators have the following associativities:

> opAssocAl = [('^', R), ('*', L), ('+', L), ('<', N), ('&', L)]

Given a character (that better be an operator), the following functions
get its precedence and associativity:

> precedence : Char -> Int
> associativity : Char -> Associativity
> precedence = assoc.opPrecAl
> associativity = assoc.opAssocAl

2. A rough and simple distinction between operator and operand
--------------------------------------------------------------

> isOperator.c = c `elem` "^*+<&"
> isOperand.c  = not.(isOperator.c)

3. Shift Reduce Parsing Basics
------------------------------
4 basic actions may be distinguished 
- shifting an operator into the stack,
- reducing the top two elements of the output by the top of the stack,
- shifting an operand onto the output (we'll call this SOp - Shift Operand)
- error - an ill-formed expression was detected
Since this is central to our enterprise we capture it into a type

> ctype Action where
>     Shift, Reduce, Error, Sop : Action


4. Relation between Parse-Actions and Associativities
-----------------------------------------------------
In order to see the close connection between parsing actions and
associativities for the moment, let us just consider expressions involving
only 1 kind of operator eg. "1^2^3" or "1+2+3" or "1<2<3"

And consider further the situation when 1 and 2 have been output and
one operator is on the stack and the next one is seen. ie we have

 input  = "^3"
 stk    = ['^']
 output = [S.1, S.2]

because the '^' is right-associative, the '^' in the input takes precedence
the '^' in the stack and therefore it gets pushed. ie we must shift

However if we look at "1+2+3" in a similar configuration, we see that a
reduction is called for.

Finally, when the second '<' in "1<2<3" is seen it is a syntax-error.

We can therefore capture our discoveries in the association-list:

> assocActionAl = [(R, Shift), (L, Reduce), (N, Error)]

5. Comparing Operators
----------------------

Now let io be the input operator, so be the stack-top operator, and
continuing with our temporary assumption that the expressions involve
only one kind of operator, we see that precedences of io and so must be
same. Even the associativities must be the same.  The precedences in fact
don't count. It is the associativities alone that (because of our temporary
assumption) determine the parsing actions.  We capture this information in

> sameCmp : Char -> Char -> Action
> sameCmp.io.so = assoc.assocActionAl.ai
>        where ai = associativity.io   -- same as associativity.so

Now we are in a position to generalise to expressions with an arbitrary mix
of operators. cmp takes an input operator, and the stack-top operator and
returns the appropriate action.

> cmp : Char -> Char -> Action
> cmp.io.so
>    | pi >  ps              = Shift
>    | pi <  ps              = Reduce
>    | pi == ps && ai /= as  = Error
>    | otherwise             = sameCmp.io.so
>       where pi = precedence.io
>             ps = precedence.so
>             ai = associativity.io
>             as = associativity.so

6. State Transformation
-----------------------

We now come to shift-reduce parsing proper.
We observe that the computation proceeds in the context of a state
consisting of three components: an input, a stack, and an output.

> type ParseState = ([Char], [Char], [Exp])

The four parser actions transform one such state into another

> type STrans = ParseState -> ParseState  -- state-transformer

The four actions are

> acSop, acShift, acReduce, acError : STrans

When an operand like '3' is output it must be made into the integer 3 and
enclosed by the leaf-constructor for expression to make S.3 .
Note: op better be an operand

> acSop.(op::input, stk, out) = (input, stk, f.op :: out)
>        where f.x  = S.(ord.x - ord.'0')

Shift is simple, it moves the top of the input into the stack.
Note: op better be an operator

> acShift.(op::input, stk, out) = (input, op::stk, out)

Before we can write reduce we must observe that a '+' in the input becomes
a Add in the output and so on.

> opCnstrAl = [('^', Pow), ('*', Mul), ('+', Add), ('<', Less), ('&', And)]


which we use in the reduce action as follows:

> acReduce.(input, op::stk, r::l::out) = (input, stk, rOpl :: out)
>       where rOpl  = (assoc.opCnstrAl.op).l.r

> acError.s@(input, stk, output) =  error.("acError " ++ (show.s))

7. Finally the parser!
----------------------

> parse : String -> Exp
> parse.input = p.initState
>    where initState = (input, [], [])

> p.s = case s of
>   ([], [], [result])      -> result
>   ([], optor::stk, out)   -> p.(acReduce.s)
>   (i::input, [], out)
>      | isOperand.i        -> p.(acSop.s)
>      | otherwise          -> p.(acShift.s)
>   (i::input, o::stk, out) 
>      | isOperand.i        -> p.(acSop.s)
>      | otherwise          -> case cmp.i.o of
>                                     Shift  -> p.(acShift.s)
>                                     Reduce -> p.(acReduce.s)
>                                     Error  -> p.(acError.s)
>                                     Sop    -> error."Should not happen"

8. Some Other Needed Stuff
--------------------------

> assoc.[].x = error.("assoc: Not found " ++ show.x)
> assoc.((x,y)::al).x'
>   | x == x'   = y
>   | otherwise = assoc.al.x'

> instance Text.Associativity
> instance Eq.Associativity where
>    x == y = gord.x == gord.y
> instance Text.Exp
> instance (Text.a, Text.b, Text.c) => Text.(a,b,c)
> primitive gord "primCharToInt" : a -> Int

ctype Exp where
     S : Int -> Exp
     Pow, Mul, Add, Less, And : Exp -> Exp -> Exp

> evaluate.(S.x) = x
> evaluate.(Add.l.r) = evaluate.l + evaluate.r
> evaluate.(Mul.l.r) = evaluate.l * evaluate.r

2 comments:

  1. Missing definitions starting with ctype mean the Haskell won't compile.

    ReplyDelete
  2. Thats gofer with modifications: http://blog.languager.org/2014/09/pugofer.html
    Basically ctype is data with GADT syntax – this was done 10 years before GADTs ie 25 years ago.
    Also another pervasive changes that will be needed for haskell:
    Interchange ':' ←→ '::'
    Replace 'f.x' with 'f x'

    If you wish I could haskellize it but need a few days.. travelling now

    ReplyDelete