[This is part 2 of the Writing Compilers series]
The first ‘real’ language I tried learning was C. I had written extensively in BASIC throughout middle school, but never felt satisfied. It was too far removed from the hardware, too distant from the ‘true’ nature of the computer. C, on the other hand, seemed like a backdoor left open by some inattentive operator, with direct access to the control room.
I purchased a copy of “C – The Complete Reference” at a local bookstore, my first technical manual.
I believed the tome to be an ancient wizard’s codex containing all the knowledge needed to master C programming, and by extension the computer. I was unprepared for the depth of technical background and vocabulary required to put the book to use. The arcane, almost alien, language consisted of symbols jumbled in random order, indirect references to objects of unknown identity lost in translation to the English language.
Simply put, “C – The Complete Reference” was not a good introductory programming guide for middle school students.
I turned instead to Pascal, the ‘native’ language for my new computer – a Macintosh 512K aka “fat mac.” Apple’s API toolbox, the ROM itself, used the Pascal calling conventions. It would be some years before I returned to C, this time with little doubt I could master it.
Compiler Construction
As a first stab at compiler construction, let’s try An Incremental Approach to Compiler Construction by Abdulaziz Ghuloum. He describes the “back-to-front” technique of compiler construction. I was introduced to this technique by my coworker Jonathan Sobel, also from IU. Jonathan says it’s a valuable technique because you “always have a working compiler,” so if you make a mistake you can just back up a step.
Roughly, the paper suggests writing small programs using an existing compiler’s back-end to generate code. By examining the code and abstracting, you can generate the same code. Repeating this process for each of the semantic concepts in your language allows you to build up a collection of code-generation templates.
Let’s see if we can extend this technique to dynamic compilation!
Fixnum Compilation in C
Total-functional languages are very similar to Haskell’s pure-functional semantics, and I’m partial (heh!) to its syntax and runtime, so I’ll be using Haskell as the implementation language. I’m writing this on a Macintosh with an x86-64 CPU, using clang instead of gcc, but I don’t foresee any roadblocks after reading the paper. I created a file “fixnum.c” which contained the “return 42” example, and compiled it with clang:
$ cat > fixnum.c #include <stdio.h> int fixnum() { return 42; } int main(int argc, char** argv) { printf("%d\n", fixnum()); return 0; } $ clang -S -O3 -fomit-frame-pointer fixnum.c $ cat fixnum.s ... _fixnum: movl$42, %eax retq ... $ clang fixnum.s $ ./a.out 42
Fixnum Compilation in Haskell
So far so good!
Let’s implement the Haskell side of this:
$ cat > haskell_fixnum.hs module Main where data Expr = Fixnum Int fixnum :: Int -> String fixnum a = " movl $" ++ (show a) ++ ", %eax\n" ++ " retq\n\n" compile :: Expr -> String compile (Fixnum x) = fixnum x main = putStr $ compile (Fixnum 42)
Here’s a quick rundown of the Haskell code:
- The “Expr” datatype is our abstract syntax.
- “fixnum” takes an Int and generates a String containing the assembly code. The emitter looks clunky because we don’t have a prelude yet, but this will slim up in future iterations.
- “compile” takes an expression in our abstract syntax and calls the appropriate code generator.
- Finally, the main function invokes “compile” on a test expression.
Now build an executable application:
$ ghc Main.hs; ./Main > haskell_fixnum.s $ cat haskell_fixnum.s movl $42, %eax retq $ clang haskell_fixnum.s Undefined symbols for architecture x86_64: "_main", referenced from: implicit entry/start for main executable ld: symbol(s) not found for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation)
Oops! Clang needs a main function to call. As a quick fix, I added a global export and added the label for ‘main’:
$ cat > haskell_fixnum.s .global _main _main: movl $42, %eax retq $ clang haskell_fixnum.s $ ./a.out $ echo $? 42
There’s no output from “a.out” since our program doesn’t call printf, but we are returning an integer as the result code of our application, and that makes it easy to print out using a shell command.
Now for some test cases! First I remove the ‘retq’ from fixnum, and put it in the epilogue instead.
fixnum :: Int -> String fixnum a = " movl $" ++ (show a) ++ ", %eax" tests :: [Expr] tests = -- Fixnum Test Cases [ (Fixnum 0) , (Fixnum 1) , (Fixnum 127) -- 7-bit transition cases , (Fixnum 128) , (Fixnum 255) -- 8-bit , (Fixnum 256) , (Fixnum 65535) -- 16-bit , (Fixnum 65536) , (Fixnum 16777215) -- 24-bit , (Fixnum 16777216) , (Fixnum 4294967295)-- 32-bit ] prolog = " .global _main\n_main:\n" epilogue = " retq\n" main = do putStr prolog putStr . unlines . map compile $ tests putStr epilogue
Here we’ve put some test expressions in a list and then mapped the compile function over the list. Unlines just glues those strings together with newlines. I prepend the label “_main” to the assembly so clang will accept it.
$ ghc Main.hs; ./Main > haskell_fixnum.s $ clang haskell_fixnum.s $ ./a.out $ echo $? 255
Not much to see, but the assembler accepted it!
Let’s check the disassembly:
$ otool -tv a.out a.out: (__TEXT,__text) section _main: 0000000100000f80 movl $0x0, %eax 0000000100000f85 movl $0x1, %eax 0000000100000f8a movl $0x7f, %eax 0000000100000f8f movl $0x80, %eax 0000000100000f94 movl $0xff, %eax 0000000100000f99 movl $0x100, %eax 0000000100000f9e movl $0xffff, %eax 0000000100000fa3 movl $0x10000, %eax 0000000100000fa8 movl $0xffffff, %eax 0000000100000fad movl $0x1000000, %eax 0000000100000fb2 movl $0xffffffff, %eax 0000000100000fb7 retq
Looks good!
Next Steps
We now have the framework for a native code generating compiler written in Haskell. In the next post I’ll go through some initial dynamic-compilation concepts, then we’ll get back to the paper to add more language constructs.