I’m just a “regular” software engineer. So when I attended a recent conference on functional programming, my reasons for attending were often remarked upon. I was asked to offer some insight into how I ended up in the land of Functional Programming, and what lessons I might impart about my journey.
To set the stage; I’ve been programming from a young age, and graduated college having majored in Computer Science. I’ve used every major language in that time for personal projects, and almost all of them in professional projects. A sampling of languages I had used before this story includes: Pascal, C/C++, C#, Java, and Objective C.
You may infer that all of my formative experiences and all of my formal training focused solely on procedural and object-oriented languages. At the time of this story, I had complete confidence in these tools and my skills to solve any software engineering challenge that might arise. But in spite of this preparedness, it was the very tools I was taught to use that would betray me.
In 2003, recently hired into Cisco Systems, I proposed a project which collected the latent resources of enterprise mobile workers (laptops, home and office desktop machines, etc.) to drive new business continuance objectives (with 9/11 a recent specter, this was a hot issue).
The resulting peer-to-peer distributed system would maintain backups, versioning, and cross-referenced indexes of critical business data and services, allowing small enterprise computing clusters to reform after major disruptions. Of particular interest was the distributed and fault-tolerant execution of these business critical applications by a loose and ever-changing collection of failure-prone mobile components.
Management gave permission to go ahead with a proof of concept, giving me lab space and a co-op student to complete the work. As an expert software engineer using the latest tools and libraries, skilled in development of distributed applications, and wielding the freedom and energy of youth, this project should have been a slam-dunk. Nothing more than an interesting challenge, a chance refine my skills as a system architect.
The reality was a quagmire of heisenbugs, poorly documented side effects, constant memory faults, race conditions, and disk corruption. Over the next few months, I scaled back the scope of the project in a desperate attempt to reduce the error contour to a reasonable space. The final (albeit successful) demo was a mere shadow of the complete design I started with.
The failure (in my opinion) of this project shook the very foundation of my being. I fully understood each component individually. Why would so much trouble get stirred up by gluing the pieces together?
It became my Arthurian quest to find the true culprit, and complete my original vision.
A database, at its heart, is a series of API calls which mutate an internal data structure representing the tables, views, and query indexes seen by the client. Good databases don’t just mutate their data structure directly; first they write the API call to a file called a Log (not the kind that collects error messages, but similar). The Log is forced to disk to ensure a record of the API call is saved in case the computer crashes while updating the structure in memory. After performing the update in memory, the structure is stored on disk and the Log Entry can be garbage collected.
While this achieves crash-tolerance, it relies heavily on the hardware for reliability. It also does not offer continuous service; in the event of a failure, the database is inaccessible until the process can load the latest snapshot and read through the log checking that the work has been completed. Lastly, it is centralized and does not allow the loose collaboration of multiple computers I envisioned.
What I needed was a technique called Fault Tolerance. This occurs when you have multiple, failure-prone computers working together to do the job of a single, more reliable computer. This means some resources are wasted by duplicating work, but it ensures non-stop service in the event of a failure. It was exactly what I needed as a foundation for my system.
It works like this:
- Multiple copies of a database exist on different computers.
- Whenever an operation is to be performed on the database, it is sent to all of them.
- Each server writes the operation to their log, updates their copy of the database, then notifies the original server that it completed successfully.
- If one of the servers fails, the other copies are used to create a new replica on another server.
The concept of my project was (seemingly) simple: record the API calls onto the logs of multiple mobile workers spread across the network. In the event of a catastrophe, collect the partial logs of many replicas to re-form the global stream of database transactions. Along the way, new master servers could be brought online whenever a few laptops are in range, and the business critical service could continue operating.
There are many interesting design decisions made as a new system is created, but I’ll focus on the crucial one for this story. I chose early on to allow 3rd party ‘apps’ to share the features of the core system: fault-tolerance, distributed execution, parallelism and load balancing. This means the framework itself must be a library accessible to applications written by other engineers — possibly from other companies.
And here we reach the crux of this story: Pascal, C++, Java, and every other language I had used up to that point is not capable of enforcing anything but the most trivial invariants on someone else’s program.
Why does this matter?
Because a fault-tolerant system only works if all the parts are fault-tolerant. That old adage about chains being as strong as the weakest link adequately describes this situation. Each component of the system expected each other component to also be written using fault-tolerant design principles. If any module deviated, the whole house of cards collapsed. Only one module was allowed to break this rule – my foundation library.
The payout for this hyper-focused design is impressive. A system like this can continue operating in the face of multiple simultaneous failures with clients completely unaware. Imagine Netflix-scale video streaming while you’re firing a machine gun into the rack of servers. Now imagine doing that every day for years on end without a single interruption in service, not one skipped frame or pause, just business as usual.
To achieve this, each application must update its state using only a stream of inputs fed from my library. If two copies of the application are fed the same stream, they must arrive at the same result. This is called Deterministic Execution, a property typically associated with mathematical functions and some models of computation (like Turing Machines). For instance, it allows recording the stream of API calls and replaying them on a new server, possibly years later, and knowing it will produce the same outputs.
Most programming languages are not deterministic. Only toy applications end up with this property. By the time an engineer has built a real-world application, it is not likely to execute deterministically. Simply reading the current time or checking for the existence of a file on the hard drive are enough to cause divergent behavior.
There are a variety of solutions to this problem for your own programs, but there is no way to ensure someone else’s program follows the rules. Even worse, it is not possible to check a program for determinism as this is equivalent to solving the Halting Problem. If you don’t start with the property, you will never have it. The best you can do is write good documentation and hope they read it. This is not the kind of foundation I wanted to build my system on!
As you might have guessed by now, all of those issues I had run into were caused by non-deterministic behaviors in the libraries I was using — even libraries that are supposed to be deterministic. As an example, when you run multiple copies of the same program on different machines, the memory is fragmented in non-deterministic ways. This causes out-of-memory exceptions at unpredictable times, and the unlucky ones coincided with the system failures I was inducing during testing. The result: recurrent bugs that seem to move around randomly in the codebase!
I searched for ways to enforce the key system invariants using new language features like C++ template-templates. I scoured books and forums, asked senior programmers, and read the documentation of dozens of competitor’s products.
The simple fact was C++ could not even express the problem, let alone offer a solution.
It took a long time, but I began to accept that the language I spent almost two decades mastering was but a journeyman’s tool. This journey had taught me the limits of that tool. I had no idea where to go from here, but I was driven to find a solution.
With a goal in hand, it became significantly simpler to filter the wide field of computer languages. In my case, I needed a language in which all programs would meet the constraints required to execute in a distributed, fault-tolerant infrastructure. This meant:
- Pure – No side effects
- Total – Always returns a result
- Strong Static Types – Enforce invariants on other modules
- Expressive – Capable of writing any algorithm I would need to implement
- Mature – Ready to use libraries for enterprise-grade software
- Documented – In-depth exposition and tutorials
- Compiled – Native execution, stand-alone binaries
- Open Source – Not required, but strongly desired
As you can see, almost every existing computer language is eliminated by this list. In particular, every language I had ever used didn’t even meet the first requirement!
In the end I settled on Haskell – a pure functional, non-strict language. It doesn’t fit perfectly (it is not Total), but it has everything else! I wrote a new language (which is Total) that compiled to Haskell. The rest of the project progressed remarkably smoothly. Finally I could write the fault-tolerance library that I had intended to write in the first place!
The resulting system worked flawlessly and withstood months of torture testing. My vision was made reality: any program written in this language would automatically be fault-tolerant, even programs written without forethought for fault-tolerance or distributed execution.
I landed in the world of Functional Programming following best-of-breed engineering practice. I built a working, enterprise-grade system, fixed issues that arose, then discovered a persistent bug. By practical root-cause-analysis it was revealed that the basic assumptions of my architecture could not be upheld.
In researching solutions, I discovered that the problem arose from my choice of tool (C++), not from my design, architecture, or implementation. The solution was to abandon the flawed tool and pick the right tool for this job: Haskell.
Your programming language is just a tool. Like all tools, it has limits. The process of learning these limits manifested almost 8 years of my professional career and made me an expert in fields of software engineering I had never considered. It has been the single most transformative experience in my career.
Learning is hard, it takes a long time, and it changes the way you think. I hope this story gives you a sense of my journey and spurs you to learn new patterns of thought too.