Synthetic Biology and Big Ball of Mud

Researchers at the MIT OpenWetWare project are attempting to engineer Synthetic Biology, creating reusable and composable biological components that can be combined to create useful organisms. In the process, they are discovering that biological systems don’t follow the same patterns of good architecture familiar to us from software.

In software engineering, architecture is perceived as critical to the success of a large implementation project, and also to the ongoing maintenance of the code created. Since software is very expensive, there is a lot of focus on architecture, choosing the right architecture, and making sure that the architecture is faithfully executed. Some of the biggest changes in software in the last 30 years, like object oriented programming and service oriented architecture, have come about because of a need for clearer architecture in ever larger collections of software that run modern life.

Architecture analysis also extends to describing anti-patterns, those things which do not work and are to be avoided. Probably the most famous anti-pattern is the Big Ball of Mud. A big ball of mud is what you get when software has evolved for many years without any clear architecture, where all abstraction barriers and divisions of responsibility have eroded in the interest of expediency and incremental functionality. The interesting thing about big balls of mud is that as a rule they work, and they are remarkably common. It can be very difficult to maintain them, almost maddening. But as long as they are economically valuable enough to justify their care and feeding, they tend to persist.

What people fail to realize about big balls of mud is that they represent a certain kind of efficiency. Maybe not efficiency from a top down, best way to solve the problem kind of approach. But every change is efficient in its own way. Engineers, either fearful of breaking the system or just lazy, make the smallest possible modifications, or the ones they are best able to understand. Abstractions are violated, variables and identifiers are reused, values are overloaded. All of these changes represent efficiently as much as they represent “bad” software engineering.

Biological systems follow the same pattern. Because of the pressures of evolution, they are filled with abstraction violations, repurposed functionality, and tight coupling. The mechanics of genetics and evolution drive change, and generally result in the smallest possible sufficient change. Tight coupling leads to efficiency and reuse. Like a big ball of mud in software, the system is difficult to understand by decomposing it into parts. Like these software systems, complex biological systems are filled with shortcuts that work most of the time, implementation details that blossom into major behaviors.

Read Montague covers this in his recent book Your Brain is (Almost) Perfect: How We Make Decisions. The original title was the clever Why Choose This Book (bringing to mind Abbie Hoffman’s Steal This Book). While most of the narrative focused on decision making in the face of limited information, the overriding principle that Montague argues from is that biological systems favor efficiency above all else. He argues that to make more efficient computers, they will need to emulate both the frailties and the strengths of biological systems.

As a student of programming languages and programming methodology, I’m intrigued by the potential for developing software systems that are highly efficient due to their high degree of coupling. The closest thing currently available is whole program analysis like the Haskell Jhc compiler or the GCC Link Time Optimizer. However, while both of these systems consider the whole program for optimization, neither is likely to produce a highly coupled or minimal program. For that the state of the art is the superoptimizer, either the original superoptimizer from 1987 or more recent attempts like TOAST. In a superoptimizer, the entire space of potential programs is searched to find the shortest implementation of a given function. This can be very helpful in core tight loops, but exhaustive search is only helpful for very simple functions. None of these compare to the results of millions of years of evolution.

But what if we had something comparable to millions of years of evolution, but for software? For example, if we had nigh-infinite computational power available inexpensively by using cloud servers in off hours, how could we take advantage of it. To me, the interesting question is not how to build an evolution machine, but rather how to constrain it so that the result is a program you find useful.

In biology and bioengineering, the approach is called forced evolution. There are a variety of techniques, but the one I’m familiar with works for cases where you want to produce chemical X from source chemical Y. First, a bacteria is engineered by any means necessary that has a biological pathway that can survive by producing X from Y, however inefficiently. Next, all other ways for the bacteria to survive are knocked out, by disabling the genes that make those pathways work. Finally, the organism is left to reproduce and evolve in a Y-rich environment. Hopeful those organisms which best optimize the Y->X pathway are the ones that reproduce more and come to dominate in this environment. The result should be a bacteria much more efficient than could have been designed from scratch, at least using current bioengineering.

Can we use forced evolution to build better software? I think this breaks down into two parts. First, can we make this work? The place where evolution of simple bacteria has the advantage is in parallel computation. Each instance of the bacteria not only has its own variant of the software but also implements its own instance of the hardware to run it. Right now computational resources, even for simple compute jobs, are still many orders of magnitude more expensive than biological resources. That said, they are falling fast. 1 minute of compute on EC2 is fourteen-hundredths of a cent, the spot price, for low-priority computation, is a third of that, and the price of a fixed amount of computation is falling steadily. This kind of workload is quite happy to take advantage of modern parallel processors. Force evolution of computational processes might be economical before we know it.

The second questions is, assuming all that works, would we want the software it produced? Today, scientists are spending vast resources trying to understand the wetware developed by natural evolution, The worst software systems are quite simple in comparison. I would expect software created by evolution to be quite opaque. More worryingly, it might be sensitive to various characteristics of the machines or environment in which it evolved. The software would likely contain dead code, unexercised race conditions, unprotected parallel data access, and many other artifacts of unrestrained expedient modifications. A simple unit test suite, even with perfect coverage, would not be able to catch all these issues. We would need a stricter environment, a set of limitations on the solution space or tests for the result that would cull badly behaved implementations, rather than allowing them to take over.

If we could figure out how to control the correctness and environmental sensitivity of evolved software, then we could also control it for designed software. Given that most designed software rapidly trends towards big ball of mud, the most immediate benefit of any work in this area might not be controlling pseudo-biological processes, but in controlling the human driven processes, keeping them from getting out of hand.


  1. There has been work in the lab on evolving FPGA behaviors. Simple two-word differentiation was evolved a few years ago. There’s a link to a modern article about that work here, although I think the work itself was more like 10 years ago.

  2. Nathan Williams says:

    I was thinking of the same FPGA experiments when I read this, particularly their reliance on what we’d consider unspecified behavior of the device, and the extreme dependance on exact hardware conditions (including temperature).

  3. Thanks, it was very interesting to read about superoptimization. My favorite paper was the one by Joshi, Nelson, Randall on their Denali Project, which shows how to find near-optimal opcode sequences to implement simple functions without trying every possible sequence. But seeing the effort/reward tradeoff in Denali, I would suspect that this approach is going to be limited to microoptimization, rather than discovering algorithms or abstractions.

    Besides the difficulty of the search, one problem with evolving larger software is, how do we specify the behavior that we are searching for? As you point out, if we specify test vectors, then we’re not searching for software that is correct, only software that is mostly correct. But giving a complete description of the software’s behavior could be almost as hard as writing the software.

    Maybe there are problem domains where we can take a cue from biology and choose to require not software that is perfect, but software that is mostly right most of the time. Though it’s not clear that humans do better, in any case :)