An Incremental Approach to Compiler Construction

[This is part 2 of the Writing Compilers series]

The first ‘real’ language I tried learning was C.  I had written extensively in BASIC throughout middle school, but never felt satisfied.  It was too far removed from the hardware, too distant from the ‘true’ nature of the computer.  C, on the other hand, seemed like a backdoor left open by some inattentive operator, with direct access to the control room.

I purchased a copy of “C – The Complete Reference” at a local bookstore, my first technical manual.

I believed the tome to be an ancient wizard’s codex containing all the knowledge needed to master C programming, and by extension the computer.  I was unprepared for the depth of technical background and vocabulary required to put the book to use.  The arcane, almost alien, language consisted of symbols jumbled in random order, indirect references to objects of unknown identity lost in translation to the English language.

Simply put, “C – The Complete Reference” was not a good introductory programming guide for middle school students.

I turned instead to Pascal, the ‘native’ language for my new computer – a Macintosh 512K aka “fat mac.”  Apple’s API toolbox, the ROM itself, used the Pascal calling conventions.  It would be some years before I returned to C, this time with little doubt I could master it.

Compiler Construction

As a first stab at compiler construction, let’s try An Incremental Approach to Compiler Construction by Abdulaziz Ghuloum.  He describes the “back-to-front” technique of compiler construction.  I was introduced to this technique by my coworker Jonathan Sobel, also from IU.  Jonathan says it’s a valuable technique because you “always have a working compiler,” so if you make a mistake you can just back up a step.

Roughly, the paper suggests writing small programs using an existing compiler’s back-end to generate code.  By examining the code and abstracting, you can generate the same code.  Repeating this process for each of the semantic concepts in your language allows you to build up a collection of code-generation templates.

Let’s see if we can extend this technique to dynamic compilation!

Fixnum Compilation in C

Total-functional languages are very similar to Haskell’s pure-functional semantics, and I’m partial (heh!) to its syntax and runtime, so I’ll be using Haskell as the implementation language.  I’m writing this on a Macintosh with an x86-64 CPU, using clang instead of gcc, but I don’t foresee any roadblocks after reading the paper.  I created a file “fixnum.c” which contained the “return 42” example, and compiled it with clang:

    $ cat > fixnum.c
    #include <stdio.h>
    int fixnum()
      return 42;
    int main(int argc, char** argv)
      printf("%d\n", fixnum());
      return 0;

    $ clang -S -O3 -fomit-frame-pointer fixnum.c
    $ cat fixnum.s
      movl$42, %eax
    $ clang fixnum.s
    $ ./a.out

Fixnum Compilation in Haskell

So far so good!

Let’s implement the Haskell side of this:

    $ cat > haskell_fixnum.hs
    module Main where

    data Expr
      = Fixnum Int

    fixnum :: Int -> String
    fixnum a = "   movl $" ++ (show a) ++ ", %eax\n" ++
               "   retq\n\n"

    compile :: Expr -> String
    compile (Fixnum x) = fixnum x

    main = putStr $ compile (Fixnum 42)

Here’s a quick rundown of the Haskell code:

  • The “Expr” datatype is our abstract syntax.
  • “fixnum” takes an Int and generates a String containing the assembly code.  The emitter looks clunky because we don’t have a prelude yet, but this will slim up in future iterations.
  • “compile” takes an expression in our abstract syntax and calls the appropriate code generator.
  • Finally, the main function invokes “compile” on a test expression.

Now build an executable application:

    $ ghc Main.hs; ./Main > haskell_fixnum.s
    $ cat haskell_fixnum.s

       movl $42, %eax

    $ clang haskell_fixnum.s 
    Undefined symbols for architecture x86_64:
      "_main", referenced from:
         implicit entry/start for main executable
    ld: symbol(s) not found for architecture x86_64
    clang: error: linker command failed with exit code 1 (use -v to see invocation)

Oops!  Clang needs a main function to call.  As a quick fix, I added a global export and added the label for ‘main’:

    $ cat > haskell_fixnum.s

        .global  _main
       movl $42, %eax

    $ clang haskell_fixnum.s 
    $ ./a.out
    $ echo $?

There’s no output from “a.out” since our program doesn’t call printf, but we are returning an integer as the result code of our application, and that makes it easy to print out using a shell command.

Now for some test cases! First I remove the ‘retq’ from fixnum, and put it in the epilogue instead.

    fixnum :: Int -> String
    fixnum a = "   movl $" ++ (show a) ++ ", %eax"

    tests :: [Expr]
    tests =
        -- Fixnum Test Cases
        [ (Fixnum 0)
        , (Fixnum 1)
        , (Fixnum 127)       -- 7-bit transition cases
        , (Fixnum 128)
        , (Fixnum 255)       -- 8-bit
        , (Fixnum 256)
        , (Fixnum 65535)     -- 16-bit
        , (Fixnum 65536)
        , (Fixnum 16777215)  -- 24-bit
        , (Fixnum 16777216)
        , (Fixnum 4294967295)-- 32-bit

    prolog   = "   .global _main\n_main:\n"
    epilogue = "   retq\n"

    main = do
           putStr prolog
           putStr . unlines . map compile $ tests
           putStr epilogue

Here we’ve put some test expressions in a list and then mapped the compile function over the list. Unlines just glues those strings together with newlines.  I prepend the label “_main” to the assembly so clang will accept it.

    $ ghc Main.hs; ./Main > haskell_fixnum.s
    $ clang haskell_fixnum.s
    $ ./a.out
    $ echo $?

Not much to see, but the assembler accepted it!
Let’s check the disassembly:

$ otool -tv a.out 
(__TEXT,__text) section
0000000100000f80	movl	$0x0, %eax
0000000100000f85	movl	$0x1, %eax
0000000100000f8a	movl	$0x7f, %eax
0000000100000f8f	movl	$0x80, %eax
0000000100000f94	movl	$0xff, %eax
0000000100000f99	movl	$0x100, %eax
0000000100000f9e	movl	$0xffff, %eax
0000000100000fa3	movl	$0x10000, %eax
0000000100000fa8	movl	$0xffffff, %eax
0000000100000fad	movl	$0x1000000, %eax
0000000100000fb2	movl	$0xffffffff, %eax
0000000100000fb7	retq

Looks good!

Next Steps

We now have the framework for a native code generating compiler written in Haskell.  In the next post I’ll go through some initial dynamic-compilation concepts, then we’ll get back to the paper to add more language constructs.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s