Designing a language is hard, and M won't change that
http://weblogs.asp.net/fbouma/archive/2008/11/05/designing-a-language-is-hard-and-m-won-t-change-that.aspx
I was a bit surprised about the large scale of the positive hype around 'M' in Oslo. Fortunately, Roger Alsing wrote a 'Back to reality' post about M which stood out as a single critical remark against the sea of positivism. I like it when someone chooses to be the more sceptical voice in the crowd: this way an outsider gets a better chance to see the whole picture instead of just one side. This post is written in that light, to give my two cents in line with what Roger said and to give some more background in what I think he meant.
What's the idea of 'M', the codename for the textual language in Oslo? It's supposed to make it easy to write DSLs. A DSL is a 'Domain Specific Language', which in short is a name for a language (in any shape or form!) to express behavior, data structures, data and what have you, for a specific, narrow, well defined domain. So if you want your application to have a custom set of rules, definable by the user, a DSL might be a good mechanism to formulate these rules in. Make no mistake about what a DSL is though: a grid, form, text in a textfile, can be a representation of a DSL. It's a concept, not a physical toolkit.
Everyone who has ever written a parser for a grammar, defined a grammar for a language or designed a language, has probably read or heard about the 'Dragon book', or more officially: Compilers: Principles, Techniques, and Tools, by Aho, Sethi and Ullman. In that book, it's explained in all detail what's needed to define a language, to recognize elements in texts containing elements defined in the grammar of the language, parse it and do something with that parse result. It's not a book for the hobby programmer, and for a reason: computer languages, and thus also DSLs, are complex beasts to handle. The domain of computer languages has been studied a lot in the past decades and it's one of the few fields in Computer Science which can be considered a bit more mature than the rest: today, the aspects of computer languages, grammars, lexers, parsers and friends are well understood and documented in many papers and books.
One of the nice things of a relatively mature field is that the techniques to work with the material are fairly mature as well. This means that if you want to define a grammar, look for a way to understand texts defined in that grammar, the solutions are available to you in mature formats, with tooling which has proven its value for many years. The question that arises often is of course: with all this tooling and knowledge, is it possible for a less educated / hobby programmer (or a non-programmer) to define a language for a given domain so the domain's complexity is easier to handle? (because that's the main purpose of DSLs). The answer is sadly: no, it's not. As the M toolset is too a toolset for producing a parser and do something with the input, it too doesn't make things easier. I'll explain why below.
The L(AL)R(k) vs. LL(k) difference
If you already wonder what LALR, LR and LL stand for, I've proven my point. I've linked them to the Wikipedia articles, so if you're not familiar with the terms you can read up and follow the rest of the argument. LALR and LR are so called 'Rightmost derivation' parsers, which can be seen as 'bottom up': it uses a stack and a state machine to recognize Non-Terminals in the input so it can in the end, understand the complete input.
When an LR parser understands it has seen a Non-Terminal N, it will replace the elements in the input stream (Terminals and / or Non-Terminals) with the Non-Terminal it just recognized. This is called 'Reduce', and it's typical an event which is used to trigger handler code outside the parser, e.g. to build up the Abstract Syntax Tree (AST). This process continues till it has reduced the whole input stream to a single Non-Terminal, the start symbol. As the Wikipedia article describes, writing a parser for LR(k) grammars by hand is hard, as you have to produce tables with data for the parser engine which is a state machine. A generator can help there. A good example of such a generator is the Gold Parser or YACC. The Gold Parser comes with a visual toolkit which allows you to test and debug your grammars.
The main problem with LALR/LR languages is that grammars can lead to conflicts, in the form of reduce/reduce conflicts or shift/reduce conflicts or even shift/shift conflicts. The conflicts all arise inside the parser's state machine where it is in state S and the path to get there with the current input decides which state will follow. Reduce/reduce conflicts occur when the parser sees an input element E and on the stack are a group of elements which together with E form a match for two (or even more!) Non-Terminals: it can't decide, based on E and the stack contents, which Non-Terminal should replace the set of elements on the stack + E. A shift/reduce conflict arises when the parser sees an element E and it has to decide whether replace the set of elements on the stack + E with Non-Terminal N1, or push E onto the stack (shift) to try to form the required elements for Non-Terminal N2. Shift/shift conflicts occur when the parser can't decide for which Non-Terminal to push onto the stack (shift) the input element E.
These conflicts should be resolved to have a proper language which doesn't contain a bug by itself. It's sometimes sufficient to pick one over the other (e.g. shift instead of reduce in a shift/reduce conflict) but it's in general not adviced to do so. Add to that that as soon as a language is ambiguistic, (like C# and VB.NET are for example), it's harder to solve these conflicts: the parser has to 'look ahead' to see what's coming to decide what to do. LALR parsers therefore are less suffering from these conflicts, yet it's something one has to deal with in a lot of cases. Though to be able to solve these conflicts, one has to understand how the parsing process works.
There's another type of parser, the LL(k) parser. An LL(k) parser is a Leftmost derivation parser, which means that it's a top-down parser. It doesn't use shift/reduce tables and a stack, instead it simply looks at the input and calls a handler for the Non-Terminal the input symbol is the start symbol of and leaves the handling to that handler. So instead of waiting for the parser to see the complete 'If' expression for the 'If' Non-Terminal to reduce it and call the handler for the 'If' Non-Terminal, it calls the 'If' Non-Terminal handler after it sees a Terminal 'if' and expects the If Non-Terminal handler to take care of it. LL(k) parsers can be seen as state machines similar to LR/LALR parsers, yet the state machine is less visible: it's inside the handler code.
As the handling of a Non-Terminal is done in a process you have every control over, as you're there along the way, it's much easier to handle errors: if you expect a ')' and it's not there, you can report the error and act as if you've seen that bracket and move on, to handle more errors (or to even ignore the error, if the language has to be very flexible). An LR/LALR parser can't do that easily because it simply has to give up: it sees an element E which combined with the stack contents can't be used to either shift or reduce. This makes LL(k) easier to handle ambiguistic languages and parse along when errors occur. Of course generators for LL(k) are available as well, a good example is the widely known ANTLR, which despite its name might suggest is an LL(k) toolkit.
One core difference between LL(k) and LALR/LR is that code to build the AST is inside the grammar for LL(k) and outside the grammar for LALR/LR(k). This is logical, because LL(k) is handled with custom handlers for the grammar's Non-Terminals, so the code which handles Non-Terminal N is custom made for N, containing the code to build the AST node for N along the way (of course using re-usable classes/methods, but the structure of the handler is unique for that Non-Terminal). To be able to do that, the custom code is specified inside the grammar which is fed to the generator which produces a proper parser for the grammar, including the custom code. For LALR/LR grammars, the code can be outside the grammar because the generator doesn't generate any code, it generates data, which is the input for the state machine to make sure the parser knows what to do when a given input is seen.
My experience with both LALR/LR and LL is that with LL, it's easier to write a grammar but harder to produce custom code for the final parser: it's very hard to debug code which is inside a grammar which is used to generate a parser which task it is to build an AST from the input which in turn is used to do something with the input. For LALR/LR it's hard(er) to get the grammar right, harder to get good error recovery, but writing the handling code to build the AST is relatively easy: you just fill in the blanks for every Non-Terminal in your grammar. I know that in one of the Oslo presentations, it was shown that the language could be debugged. The ANTLR tools can do that too (for Java), though does what the M tools can do, make things less complex than the complexitiy level of the ANTLR tools? One still has to know what's on the screen, what everything means and what's handled by what at what moment.
So when I look at M's elements, I can only conclude that it too has the same aspects as the well known tooling and building blocks for parsers, grammars and the like available to everyone today: it's complex. With 'complex' I mean: you need proper knowledge about computer languages, parsers and grammars to understand what's going on and how to use the tools available properly. There's nothing wrong with that: if something is hard, well... that's life, deal with it by understanding the concepts which makes a hard problem manageable. Though to me it's impossible that all crucial aspects of writing computer languages are suddenly not important anymore with M's elements on the table. Sure, people will write DSLs with M's elements, but today you can do that too with ANTLR or Gold Parser, use a visual toolkit to design, test and debug the language, and you're likely up and running in a day or two. That doesn't mean a person who has no clue what LL(k) means, what an AST is, can create a language for a given domain with ANTLR, some knowledge about parsers, grammars etc. is required. And with M's elements, this isn't any different at all.
That's not bad, not at all. It's however to me a different reality than the picture Microsoft tried to paint with the Oslo material provided through the PDC, and therefore I fully agree with what Roger said in his blogpost.
The core issue is that to be able to write a succesful DSL, you have to be able to create a grammar, within the restrictions of the system in which you have to specify the grammar. With the restrictions, rules and aspects of the various parsing techniques as the guideline of what you can do inside the grammar, it's still very hard to define a grammar for a DSL, because not only has the DSL to be expressive enough and compact enough, it also has to work on all valid input and be able to produce a proper result in the form expected. If you want to experience how complex this is, try to write a very simple business rule grammar, in EBNF, which for example allows you to specify validation rules for objects. The difficulty is for example in 'all valid input'. What's valid input for your language? How do you determine that? You do that by defining rules, which will eventually become your grammar. An initial start is probably easy, though it gets out of hand pretty quickly if you want to add another couple of Non-Terminals, or want to re-use a Terminal inside multiple Non-Terminals, which might all of a sudden lead to an ambiguistic language and thus either lead to conflicts or more complex code in your handlers.
I appreciate Microsoft for making the effort in this direction, and future will tell what the M toolbox will become. I truly hope it can meet the expectations of many people around, though I seriously doubt it will be a set of tools used by what looks like one of the big parts of the target audience: the group of people who lack the knowledge of grammars, parsers, languages and the like, as to me it doesn't lower the bar of complexity one bit. While that's not something to worry about in general, it is an aspect we have to be careful about: there's already enough complexity in our field: we need elements to make things more understandable, to make complexity more manageable, not to muddy the waters as if it's all of a sudden easy as tic-tac-toe to write a grammar, language, parser and AST interpreter code and every non-programmer can create DSLs.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment