The MAR9000's eye

Post #1

Marco LOMBARDO
2014-02-15

If you are a Java developer and you are interested in Domain Specific Language (DSL) and Code Generation, soon or late you are going to play a bit with ANTLR. In addition if you are such kind of person you will probably know the Martin Fowler bliki. Now something personal: I in general dislike working with graphic tools when I can do the same thing by coding and/or command line (who knows if in one of my next posts I will decide to explain why). I also dislike to store into a database things that are much more comfortable into the file system. All these reasons drive me to implement my own bliki.

I have given also an opportunity to WordPress, indeed a spanish blog I translate to italian is maintained with WordPress, but let's speak about the static part of this site.

The first (bad) grammar

Because I aim to experiment with ANTLR I decided to wrote a small language to define a blog post. Once you have such a language and you can parse your posts you can use this data to:

create a main page with only last posts.
create a page to show only posts tagged with a specific label.
create your RSS feed.

A post will be something like:

title: 
url: 
date: 
tags: antlr,java, ..
content: the HTML part of the post

If you are new to ANTLR the first grammar will be:

post: title url date tags content;

title: 'title:' LINE;
url: 'url:' LINE;
date: 'date:' DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT NL;
tags: 'tags:' WORDS? (',' WORDS)? '\n';
content: 'content:' .*;

DIGIT: [0-9];
WORDS: ([a-zA-Z0-9] | ' ')+;
LINE: ~[\r\n]* NL;
NL: '\r'? '\n';

If you try this grammar with:

$ antlr4 BadBlog.g4
$ javac *.java
$ grun BadBlog post -tokens test1.post

You will receive some errors like this line 1:0 missing 'title:' at 'title: something\n'. Why ANTLR says title: is missing if it's actually inside the file?

The first good grammar

This fact is stated at page 15 of The Definitive ANTLR 4 Reference:

Note that lexers try to match the longest string possible

Out lexer consume title: while matching the LINE rule, and this is visible from the preceeding command:

[@0,0:18='title: something\n',<11>,1:0]

The token 11 is LINE.

The solution is to implement everything at lexer level (I introduce "..." to end the content rule):

post: TITLE URL DATE TAGS CONTENT;

TITLE: 'title:' .*? NL;
URL: 'url:' .*? NL;
DATE: 'date:' .*? NL;
TAGS: 'tags:' .*? NL;
CONTENT: 'content:' .*? NL '...' NL;
NL : '\r'? '\n';

If you test this you will see that the grammar successfully parse the file at the price of having also starting and ending string when accessing the AST, e.g. TITLE().getText() will contains also title:.

The island grammar

With our grammar we want basically to parse:

a tag, like title:, associated with one line.
a tag, like content:, associated with more lines.
a list of words, like tags: .

This is our meta-model expressed formally into the grammar we are going to write.

The Lexer respect rules precedence but here the problem is that the LINE rule has no start condition and once it starts will match for instance always more chars than WORDS. The solution are lexer modes but for this you should split your grammar in a lexer and parser grammars, see BlogLexer.g4 and BlogParser.g4 . You need a sequence that start a mode and a sequence that switch back to the default mode. Inside a mode you have different lexer rules, for instance after title: we match chars until a new line while after content: the new line char alone has nothing special and we match a longer sequence as you can see reading the grammar.
The only remark is how we match a long sequence of chars, the CH rule, into the lexer that the parser join together into a chars object.

I created an eclipse project for this blog, you can play with my grammar:

use compile-lexer.launch to compile the lexer, then
use compile-parser.launch to compile the parser, then
refresh the eclipse project
create the html from templates using update-web-gen.launch.

There is also a grun.launch to have from eclipse the same output of grun command, but while developing a new grammar, at least when it's a small grammar, it's easier from the command line.
The rest of the code at the moment are simple code that parses post files and output HTML files using StringTemplate.

Conclusion

When you develop a grammar usually you look at the final result that is produced by the parser, however you have to don't forget that it receives what the lexer prepares.

Post #1

The first (bad) grammar

The first good grammar

The island grammar

Conclusion

Tags: