Proactive Assumption Violation: Avoiding Bugs By Behaving Badly
Bugs are a fact of life in software, and probably always will be. Some bugs are probably unavoidable, but a lot of bugs can be avoided through good architecture, defensive programming, immutability, and other techniques. One major source of bugs, especially frustrating bugs, is non-deterministic behavior. Every programmer has experienced bugs which don’t reproduce, which require a special environment, or special timing, or even just luck to make happen. To avoid these bugs, programmers learn to favor determinism, making sure their software behaves the same way every time.
But sometimes a little extra non-determinism can help to avoid bugs. When designing a library the specification may contain caveats which the implementation does not exercise. If a program only ever uses one implementation, or doesn’t exercise the full range of behaviors while testing, then the program may depend on behaviors which are an artifact of the implementation. When an alternative implementation is used, or circumstances exercise other behaviors, then the program will exhibit bugs. What I would like to suggest is that instead the implementation proactively violate any assumptions the client might have, by deliberately and non-deterministically taking advantage of all the caveats in the interface specification. This will force clients to code defensively, and help to eliminate this class of bugs.
For example, an interface for Set might include iteration, but say that iteration order is unspecified. Most Set implementations, like Java’s HashSet, will iterate in a stable order if the set doesn’t change. The order might even be consistent from one run of the program to the next. But if programs depend on iteration stability, substituting a different implementation, such as one based on a splay tree, will introduce bugs. If instead the implementation of Set iteration deliberately returned a different iteration order each time, then programs would be unable to depend upon it.
For a real-world example, consider the standard C function memcpy. According to the specification, if the source and destination buffers overlap, behavior is undefined. But what does undefined really mean? Recently, Linux switched to a new memcpy implementation, one which copies backwards (high bytes first, low bytes last), because it is faster on modern hardware. The result is a dramatic change in overlapping buffer behavior, leading to difficult to isolate bugs like Red Hat Bug 639477: Strange sound on mp3 flash website. The bug was eventually tracked down to memcpy using valgrind. But if the original memcpy had been more deliberately harmful in the case of overlapping buffers, than bugs like this would not be created in the first place.
Another place where this comes up is thread safety. Many APIs are not thread safe, but the results of using them in a thread-unsafe way are benign, or unlikely. Take the Java DOM API for XML documents. This is not a thread safe API, which is not surprising for a complex mutable data structure. What is a bit surprising is that even just reading the Java DOM from multiple threads can have unintended consequences. This is because of a cache for reuse of Node objects, and the failure mode is that very occasionally accessor functions return null when they are called from multiple threads simultaneously. Debugging a program that was suffering from this behavior took several hours, because the undesirable behavior is very infrequent. Is there a way we can apply the principle of proactive assumption violation to make this sort of bug less common?
System that cope with infrastructure faults by degrading their behavior are another case where proactive assumption violation can reduce bugs later. NoSQL databases are well known for taking approaches like eventual consistency in order to offer better performance and availability. But that means that when the system is under heavy load or suffering from partial outages, consistency may take a long time to resolve. I ran into this as a user of Netflix the other night. My television and my laptop had two different ideas of what my current Netflix queue contained; my television couldn’t see the recent updates. It turns out there is even a slide deck from Netflix describing the architecture choices that led to my undesirable user experience. Most of the time things are consistent enough that I wouldn’t see this asynchrony. But when I do see it, as a user there is no obvious way to get around it or even to know that it is happening. Would a NoSQL infrastructure that let consistency drift more often end up with better average user experience?
Clients of interfaces often make assumptions about how those interfaces work, assumptions that are explicitly or implicitly not part of the spec. But if implementations of the interface don’t violate those assumptions, then programs can be developed which require them to be true. This leads to unexpected and expensive bugs if and when those assumptions are violated. One solutions is for implementations to deliberately violate these assumptions, for no other reason than to force clients of their interface to future-proof. The result is more work up front for programmers, but fewer bugs in the long run.
What do you think about proactive assumption violation? Is it a technique you have ever used? Have you experienced bugs which would be avoided if others had employed proactive assumption violation?