Fun with ANTLR -- The AREV translation project

Certainly the most fun I've had with antlr has been the AREV to VB translator. Here's a writeup I did shortly after the project was over. It was written in '98.

Deep in Southern Oregon, living in a monstrous web of code written in an obsolete 4GL on an antiquated OS, lie the brains of the country's 30th largest mail order company. Every year for the past 8 years, including the year it was installed, hoards of programmers spend January through August copying, pasting and tweaking code to add some new features to the system. From September through December they're on call to keep the system up and running, and called frequently. This company ships 90% of the year's orders in the six weeks before Christmas. And despite this anchor of an information system they continue to grow, expanding their sales volume and maintenance headaches every year.

This year they realized they just couldn't handle the volume with their networking system; they absolutely had to switch over to TCP/IP and a modern OS. Unfortunately the user interface code, which had a good portion of the business logic in it, was written in an old, unsupported version of Advanced Revelation R/BASIC (AREV) and wasn't portable to this modern OS. They decided to try and port it to Visual Basic.

So I got a call from the project manager. "Hey, can you help me with some Perl code?" he asked.

Immediately I responded "You really want to use Python, don't you?"

"Well, I wanted to get started on a tool to help us port this AREV code to VB."

"In that case you really want to use ANTLR. And I just happen to have returned from the ANTLR conference only a month ago...."

So I got the job and had the task of parsing and translating the AREV code to VB. Oh the fun I had! Starting from a truly ambiguous language I wrote a parser, three analysis passes, four transformation passes, and the final pass to output the resulting code. I preserved comments along the way and formatted it nicely too. I worked with an excellent team of AREV and VB experts who figured out how to translate things by hand. From those examples I wrote the ANTLR code to do it. It took me six months to write this translator which processed 240,000 lines of code. Overall the translator is actually a small part of the whole project. Building the libraries for the translated code to use and debugging have taken a team of six programmers over nine months to get it almost completely working.

This was my first major ANTLR project. I love the fact that I can examine and understand the generated code to figure out exactly what it was doing and why. I leaned heavily on the new grammar inheritance feature, writing a tree grammar that did nothing and then subclassing it to specialize for the particular pass I was doing. I had a fixed set of code to transform, so I could find all the parsing errors that would ever happen. The whole project was a stop gap measure to buy time for a real rewrite from scratch, so this code was basically throwaway code, excellent for a learning project. I can't compare ANTLR to any other toolkits since I haven't used any others, but I was very happy with ANTLR's features and was always able to do what I needed to do to get the job done.

ANTLR's tree manipulation facilities provided the foundation for straightforward translation. First you parse a program and built an Abstract Syntax Tree (AST). An AST lets you organize a program they way you want to think about it: functions, variables, expressions, each would have a tree structure to it. For instance the expression A*B+C could be a tree with '*' as the root and two children. One child would be A, the other would be a subtree rooted with '+', with children B and C. Trees give a two dimensional structure to the program which makes it very easy to manipulate.

ANTLR lets you write tree grammars to represent the structure of the tree. Since you have the entire program in the tree, you can also get context for any tree structure you are looking at. For instance I needed to know whether I saw a return statement was nested inside a conditional or not to determine the logical end of a routine since AREV didn't force the programmer to break things down into functions or subroutines manually. You can think of a subtree like a return expression as an object representing the expression, but with a tree grammar you can also give yourself the context of where that expression occurs, which is very important for translating.

Most of the actual translation happens by manipulating trees. I used ANTLR's grammar inheritance feature to write a complete tree grammar that did nothing except recognize my AST. Then for each pass I subclassed that grammar and overrode only what I needed to. For instance, in my pass where I rewrote things like a+=b into a = a + b, I only had to override a few expression rules to do that work, the rest of the grammar was inherited from the superclass.

Parsing was certainly the hardest part. This language has no reserved keywords. Also it has constructs called dynamic arrays which can look like this:

  IF R.CUST.ADDR[1,6] = 'PO BOX'
  

The <> are for array references, [] for substring references. Of course < is also the less than operator and > is greater than. You can't even assume that >= is "greater than equals" because it could be the closing of a dynamic array reference being assigned to something. ANTLR's syntactic predicates let me do infinite lookahead to try "<" as a dynamic array reference and if that failed then assume it's actually "less than". The parser took a little over two months to write, which included tree building and the superclass tree parser which didn't do anything but walk the tree. My parser actually found syntax errors that the AREV interpreter didn't!

With the parser out of the way, the fun really began! The above AREV code needed to be translated from that postfix notation to a series of function calls like this:

  IF (arvEQ(arvSUBSTRING(arvEXTRACT(R_CUST_ADDR, CUA_ADDR_LN1, 0, 0), 1, 6), "PO BOX") 
  

And, dynamic array references and substring operators could be on the left hand of an assignment as well, which requires a different translation. AREV also has a quirky equals operator that will coerce strings into integers if the string holds integers and is compared to an integer. This is why there is the arvEQ call instead of an equals operator above. Getting this transformation to work actually proved to be straightforward, to my surprise.

My first pass was to do some "#include" style preprocessing. Then I created a bunch of tables with all the symbols and labels. Following that a simple transformation of "a+=b" and kin, which VB doesn't support, into "a = a + b". Then I built a table of all assignments to variables to facilitate callgraph analysis. This language lets you specify which module it will jump to next in a variable. Program names usually show up as string constants somewhere, as opposed to being constructed from expressions, so I cached all the assignments to each variable. For callgraph analysis I found the variables that were called as programs and then found all the assignments to them and transitively expanded those until I had all the string constants assigned to this variable. I filtered those against the known program names to get a list of the programs that this one could possibly be calling.

The kitchen sink pass does the dynamic array transforms as specified above. It also turns the AREV case construct, which bears no relations to a C style case statement, into a series of if-else statements. AREV also has many language constructs that are like loops for specialized file processing. These were converted into plain loops with extra code thrown in. The print statement had many different behaviors based on the semantics of the call. These were analyzed and translated appropriately. For each file I also did some control flow analysis to determine where the main routine ended and error handling code began. The error handlers were only reached by gosub. Each of these was analyzed so they could be broken up into individual subroutines. I also commented out any unused labels. Certain language constructs were so complex and used infrequently enough that they were simply transformed into comments with warnings, to be translated by hand later.

The next to last pass analyzes for stranded, unreachable code. Also, various expression idioms were translated into more efficient VB expressions.

The final pass generates the VB code. It handles name-mangling and also generates declarations for all the variables, inferring their types when possible. It also pretty prints the code.

It was great working with a team of experts in both languages. Once they could describe how they would translate something by hand it was usually straightforward to do it in ANTLR. I typically made some changes, ran the translator overnight, then talked with them in the morning to see if there were any VB compilation problems or other tweaks needed. This was a great win over translating by hand because any changes we made would affect all the code at once, in the same way. This was a very iterative process, and really needed to be. So many little details come up that would have never been predicted.

Once we wound down and they started to get the monster actually running I expected to be called in to fix some stuff that was done wrong or forgotten. After a couple of months they finally found a bug in the translator. I fixed it and re-ran the translator on everything. To our relief it had only affected a few lines of two programs. The system is now in final testing and will be going into production soon.

ANTLR proved to be a robust and effective toolkit. For this kind of translation task the tree rewriting facilities seemed a natural fit to the problem. I know that professionals doing translation tasks shaped ANTLR, and it really shows. There were no mysteries as to how it works that I couldn't figure out by just reading the generated code or asking the ANTLR community about when I really got stuck.